Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Filtering for a numeric value


Copy link to this message
-
RE: Filtering for a numeric value
Santhosh Srinivasan 2009-02-08, 00:30
You should probably build the pig.jar from trunk for using types. The
released version of Pig does not support types. Look out for a release
soon with types.

Santhosh

-----Original Message-----
From: Gregory Harman [mailto:[EMAIL PROTECTED]]
Sent: Saturday, February 07, 2009 12:53 PM
To: [EMAIL PROTECTED]
Subject: Filtering for a numeric value

Hi all,

I'm just getting started with Pig, and am having a problem that I'm  
hoping is some standard rookie mistake. I have the following data in a  
file "t.csv":

59001000,FOO,6/29/08,22,23,BAR
59001001,FOO,6/29/08,23,24,BAR

I want to read in this file, ensuring that the first column is numeric  
(so that it won't break a filter on that column with a numeric  
operator such as > later), so I first tried to cast it in a schema  
like so:

raw = LOAD 't.csv' USING PigStorage (',') as (a:long, b:chararray,  
c:chararray, d:int, e:int, f:chararray);

but I get the following stack trace:

2009-02-07 15:45:47,011 [main] ERROR org.apache.pig.tools.grunt.Grunt  
- java.io.IOException: Encountered ": long" at line 1, column 48.
Was expecting one of:
     "," ...
     ")" ...
     ":" "(" ...
     ":" "[" ...

at org.apache.pig.PigServer.registerQuery(PigServer.java:278)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:
475)
at  
org
.apache
.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:
233)
at  
org
.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:
81)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
at org.apache.pig.Main.main(Main.java:294)
Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException:  
Encountered ": long" at line 1, column 48.
Was expecting one of:
     "," ...
     ")" ...
     ":" "(" ...
     ":" "[" ...

at  
org
.apache
.pig
.impl
.logicalLayer
.parser.QueryParser.generateParseException(QueryParser.java:4885)
at  
org
.apache
.pig
.impl
.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:4762)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.SchemaTuple(QueryParser.java:
2567)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:660)
at  
org
.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:
512)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:362)
at  
org
.apache
.pig
.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:47)
at org.apache.pig.PigServer.registerQuery(PigServer.java:275)
... 5 more

2009-02-07 15:45:47,012 [main] ERROR org.apache.pig.tools.grunt.Grunt  
- Encountered ": long" at line 1, column 48.
Was expecting one of:
     "," ...
     ")" ...
     ":" "(" ...
     ":" "[" ...

I next tried loading it in with no type casting and instead filtering  
based on a numeric-only regexp:

raw = LOAD 't.csv' USING PigStorage (',') as (a, b, c, d, e, f);
clean = FILTER raw BY ($0 matches '\d+'); /* \d+ should be Java regexp  
for one or more digits */

but this resulted in:

org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error  
at line 1, column 37.  Encountered: "d" (100), after : "\'\\"
at  
org
.apache
.pig
.impl
.logicalLayer
.parser
.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1623)
at  
org
.apache
.pig
.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:
4771)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_91(QueryParser.java:
4439)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_72(QueryParser.java:
4492)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_54(QueryParser.java:
4536)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_39(QueryParser.java:
4590)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_27(QueryParser.java:
4606)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:4369)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:3820)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.PUnaryCond(QueryParser.java:
1105)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.PAndCond(QueryParser.java:
1055)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.POrCond(QueryParser.java:1005)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.PCond(QueryParser.java:973)
at  
org
.apache
.pig
.impl.logicalLayer.parser.QueryParser.FilterClause(QueryParser.java:941)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:686)
at  
org
.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:
512)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:362)
at  
org
.apache
.pig
.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:47)
at org.apache.pig.PigServer.registerQuery(PigServer.java:275)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:
475)
at  
org
.apache
.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:
233)
at  
org
.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:
81)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
at org.apache.pig.Main.main(Main.java:294)

I'm using Pig 0.1.1 in local mode.

I'm interested in knowing if there's a better way to accomplish my  
goal (filter on a value range for the first column), but I'd also like  
to know why these two breakages happened, as both should be valid as  
far as I understand the docs...

thanks,
Greg