Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Re: strange gzip-related error


+
Joe Crobak 2012-04-09, 19:41
Copy link to this message
-
Re: strange gzip-related error
So in this case, it seems like JsonStringToMap is properly catching the
parse exception; in fact, it's the catch clause of the UDF that's
generating the "Could not json-decode string" message in your task tracker
logs.

Take a look at line 63 here:
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java

When a parse exception happens, the UDF returns a null.  Are you filtering
out nulls before trying to project?

Norbert

On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <[EMAIL PROTECTED]> wrote:

> so it turns out our uncompressed data contains corrupted rows. Is there a
> way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
> unparsable lines and just skip them?
>
> On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> > order to get json parsing. I have an incredibly unusual error that I see
> > with certain gzip compressed files. It's probably easiest to show you a
> pig
> > session:
> >
> > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> > grunt> register '/home/joe/json-simple-1.1.jar';
> > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> > TextLoader() as (line: chararray);
> > grunt> X = FOREACH apiHits GENERATE line,
> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> > grunt> Y = LIMIT X 2;
> > grunt> dump Y;
> > (succeeds, and I get what I expect).
> >
> > Now, if I try to do a projection using the json field, I get the
> following:
> >
> > grunt> A = FILTER X BY
> > >>   json#'logtype' == 'foo'
> > >>   OR json#'consumer' == 'foo1'
> > >>   OR json#'consumer' == 'foo2'
> > >>   OR json#'consumer' == 'foo3'
> > >>   OR json#'consumer' == 'foo4'
> > >>   ;
> > grunt> B = LIMIT A 2;
> > grunt> dump B;
> >
> > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR:
> java.lang.Long
> > cannot be cast to org.json.simple.JSONObject
> >
> > And in the task tracker logs, the stack trace suggests that the json udf
> > is seeing compressed data [1]. Does anyone have any ideas how to debug
> > this, or guesses to what the problem is? Can I somehow determine if
> hadoop
> > is actually decompressing the data or not?
> >
> > Thanks!
> > Joe
> >
> > [1]
> >
> > 2012-04-05 14:39:20,211 WARN
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
> json-decode string:  � ���
> > Unexpected character ( ) at position 0.
> >       at org.json.simple.parser.Yylex.yylex(Unknown Source)
> >       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)
+
Joe Crobak 2012-04-09, 21:27
+
Dmitriy Ryaboy 2012-04-09, 21:33
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB