Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Filtering for a numeric value


Copy link to this message
-
RE: Filtering for a numeric value
You should probably build the pig.jar from trunk for using types. The
released version of Pig does not support types. Look out for a release
soon with types.

Santhosh

-----Original Message-----
From: Gregory Harman [mailto:[EMAIL PROTECTED]]
Sent: Saturday, February 07, 2009 12:53 PM
To: [EMAIL PROTECTED]
Subject: Filtering for a numeric value

Hi all,

I'm just getting started with Pig, and am having a problem that I'm  
hoping is some standard rookie mistake. I have the following data in a  
file "t.csv":

59001000,FOO,6/29/08,22,23,BAR
59001001,FOO,6/29/08,23,24,BAR

I want to read in this file, ensuring that the first column is numeric  
(so that it won't break a filter on that column with a numeric  
operator such as > later), so I first tried to cast it in a schema  
like so:

raw = LOAD 't.csv' USING PigStorage (',') as (a:long, b:chararray,  
c:chararray, d:int, e:int, f:chararray);

but I get the following stack trace:

2009-02-07 15:45:47,011 [main] ERROR org.apache.pig.tools.grunt.Grunt  
- java.io.IOException: Encountered ": long" at line 1, column 48.
Was expecting one of:
     "," ...
     ")" ...
     ":" "(" ...
     ":" "[" ...

at org.apache.pig.PigServer.registerQuery(PigServer.java:278)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:
475)
at  
org
.apache
.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:
233)
at  
org
.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:
81)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
at org.apache.pig.Main.main(Main.java:294)
Caused by: org.apache.pig.impl.logicalLayer.parser.ParseException:  
Encountered ": long" at line 1, column 48.
Was expecting one of:
     "," ...
     ")" ...
     ":" "(" ...
     ":" "[" ...

at  
org
.apache
.pig
.impl
.logicalLayer
.parser.QueryParser.generateParseException(QueryParser.java:4885)
at  
org
.apache
.pig
.impl
.logicalLayer.parser.QueryParser.jj_consume_token(QueryParser.java:4762)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.SchemaTuple(QueryParser.java:
2567)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:660)
at  
org
.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:
512)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:362)
at  
org
.apache
.pig
.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:47)
at org.apache.pig.PigServer.registerQuery(PigServer.java:275)
... 5 more

2009-02-07 15:45:47,012 [main] ERROR org.apache.pig.tools.grunt.Grunt  
- Encountered ": long" at line 1, column 48.
Was expecting one of:
     "," ...
     ")" ...
     ":" "(" ...
     ":" "[" ...

I next tried loading it in with no type casting and instead filtering  
based on a numeric-only regexp:

raw = LOAD 't.csv' USING PigStorage (',') as (a, b, c, d, e, f);
clean = FILTER raw BY ($0 matches '\d+'); /* \d+ should be Java regexp  
for one or more digits */

but this resulted in:

org.apache.pig.impl.logicalLayer.parser.TokenMgrError: Lexical error  
at line 1, column 37.  Encountered: "d" (100), after : "\'\\"
at  
org
.apache
.pig
.impl
.logicalLayer
.parser
.QueryParserTokenManager.getNextToken(QueryParserTokenManager.java:1623)
at  
org
.apache
.pig
.impl.logicalLayer.parser.QueryParser.jj_scan_token(QueryParser.java:
4771)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_91(QueryParser.java:
4439)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_72(QueryParser.java:
4492)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_54(QueryParser.java:
4536)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_39(QueryParser.java:
4590)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3R_27(QueryParser.java:
4606)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_3_4(QueryParser.java:4369)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.jj_2_4(QueryParser.java:3820)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.PUnaryCond(QueryParser.java:
1105)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.PAndCond(QueryParser.java:
1055)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.POrCond(QueryParser.java:1005)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.PCond(QueryParser.java:973)
at  
org
.apache
.pig
.impl.logicalLayer.parser.QueryParser.FilterClause(QueryParser.java:941)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.BaseExpr(QueryParser.java:686)
at  
org
.apache.pig.impl.logicalLayer.parser.QueryParser.Expr(QueryParser.java:
512)
at  
org
.apache
.pig.impl.logicalLayer.parser.QueryParser.Parse(QueryParser.java:362)
at  
org
.apache
.pig
.impl.logicalLayer.LogicalPlanBuilder.parse(LogicalPlanBuilder.java:47)
at org.apache.pig.PigServer.registerQuery(PigServer.java:275)
at
org.apache.pig.tools.grunt.GruntParser.processPig(GruntParser.java:
475)
at  
org
.apache
.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:
233)
at  
org
.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:
81)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:60)
at org.apache.pig.Main.main(Main.java:294)

I'm using Pig 0.1.1 in local mode.

I'm interested in knowing if there's a better way to accomplish my  
goal (filter on a value range for the first column), but I'd also like  
to know why these two breakages happened, as both should be valid as  
far as I understand the docs...

thanks,
Greg
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB