Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Re: strange gzip-related error


Copy link to this message
-
Re: strange gzip-related error
So in this case, it seems like JsonStringToMap is properly catching the
parse exception; in fact, it's the catch clause of the UDF that's
generating the "Could not json-decode string" message in your task tracker
logs.

Take a look at line 63 here:
https://github.com/kevinweil/elephant-bird/blob/master/src/java/com/twitter/elephantbird/pig/piggybank/JsonStringToMap.java

When a parse exception happens, the UDF returns a null.  Are you filtering
out nulls before trying to project?

Norbert

On Mon, Apr 9, 2012 at 3:41 PM, Joe Crobak <[EMAIL PROTECTED]> wrote:

> so it turns out our uncompressed data contains corrupted rows. Is there a
> way to easily tell wrap the JsonStringToMap UDF to catch exceptions on
> unparsable lines and just skip them?
>
> On Thu, Apr 5, 2012 at 11:44 AM, Joe Crobak <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > I'm using pig 0.9.2 on cdh3u3 with a snapshot-build of elephant bird in
> > order to get json parsing. I have an incredibly unusual error that I see
> > with certain gzip compressed files. It's probably easiest to show you a
> pig
> > session:
> >
> > grunt> register '/home/joe/elephant-bird-2.1.12-SNAPSHOT.jar';
> > grunt> register '/home/joe/json-simple-1.1.jar';
> > grunt> apiHits = LOAD '/user/joe/path/to/part-r-00000.gz' USING
> > TextLoader() as (line: chararray);
> > grunt> X = FOREACH apiHits GENERATE line,
> > com.twitter.elephantbird.pig.piggybank.JsonStringToMap(line) as json;
> > grunt> Y = LIMIT X 2;
> > grunt> dump Y;
> > (succeeds, and I get what I expect).
> >
> > Now, if I try to do a projection using the json field, I get the
> following:
> >
> > grunt> A = FILTER X BY
> > >>   json#'logtype' == 'foo'
> > >>   OR json#'consumer' == 'foo1'
> > >>   OR json#'consumer' == 'foo2'
> > >>   OR json#'consumer' == 'foo3'
> > >>   OR json#'consumer' == 'foo4'
> > >>   ;
> > grunt> B = LIMIT A 2;
> > grunt> dump B;
> >
> > ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR:
> java.lang.Long
> > cannot be cast to org.json.simple.JSONObject
> >
> > And in the task tracker logs, the stack trace suggests that the json udf
> > is seeing compressed data [1]. Does anyone have any ideas how to debug
> > this, or guesses to what the problem is? Can I somehow determine if
> hadoop
> > is actually decompressing the data or not?
> >
> > Thanks!
> > Joe
> >
> > [1]
> >
> > 2012-04-05 14:39:20,211 WARN
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap: Could not
> json-decode string:  � ���
> > Unexpected character ( ) at position 0.
> >       at org.json.simple.parser.Yylex.yylex(Unknown Source)
> >       at org.json.simple.parser.JSONParser.nextToken(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at org.json.simple.parser.JSONParser.parse(Unknown Source)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.parseStringToMap(JsonStringToMap.java:63)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:53)
> >       at
> com.twitter.elephantbird.pig.piggybank.JsonStringToMap.exec(JsonStringToMap.java:25)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:216)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:299)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:332)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:332)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:284)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:290)
> >       at
> org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POFilter.getNext(POFilter.java:95)