Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Problem loading sequence files with Elephant Bird


Copy link to this message
-
Re: Problem loading sequence files with Elephant Bird
'AS' is almost always dangerous. The loader already has a schema. Use a
projection if you want to rename them.

On Fri, May 18, 2012 at 4:07 PM, Chris Diehl <[EMAIL PROTECTED]> wrote:

> With a little bit of luck, we managed to find an answer.
>
> Turns out we needed to remove the cast from key and run the script in Pig
> 0.10. I was running the script with Pig 0.8.1 up until today.
>
> raw_logs = LOAD '$INPUT_LOCATION' USING $SEQFILE_LOADER ('-c
> $NULL_CONVERTER','-c $TEXT_CONVERTER')
>     AS (key, value: chararray);
>
> Chris
>
> On Fri, May 18, 2012 at 2:27 PM, Chris Diehl <[EMAIL PROTECTED]> wrote:
>
> > Hi Andy,
> >
> > Here's what is in the log file.
> >
> > Pig Stack Trace
> > ---------------
> > ERROR 2244: Job failed, hadoop does not return any error message
> >
> > org.apache.pig.backend.executionengine.ExecException: ERROR 2244: Job
> > failed, hadoop does not return any error message
> > at
> > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:119)
> >  at
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:172)
> > at
> >
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
> >  at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
> > at org.apache.pig.Main.run(Main.java:500)
> >  at org.apache.pig.Main.main(Main.java:107)
> >
> >
> ===============================================================================> >
> > I am running it on the cluster. I could not find any additional
> > information on the job tracker.
> >
> > The keys in the sequence files are all null. The values are all JSON
> > strings. Given that information, I tried configuring the
> SequenceFileLoader
> > this way to no avail.
> >
> > %declare SEQFILE_LOADER
> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> > %declare TEXT_CONVERTER
> 'com.twitter.elephantbird.pig.util.TextConverter';
> > %declare NULL_CONVERTER
> > 'com.twitter.elephantbird.pig.util.NullWritableConverter'
> >
> > raw_logs = LOAD '$INPUT_LOCATION' USING $SEQFILE_LOADER ('-c
> > $NULL_CONVERTER','-c $TEXT_CONVERTER') AS (key: chararray, value:
> > chararray);
> >
> > Is there another way I should be configuring it?
> >
> > Chris
> >
> > On Fri, May 18, 2012 at 11:24 AM, Andy Schlaikjer <
> > [EMAIL PROTECTED]> wrote:
> >
> >> Chris, the console output mentions file "/opt/shared_storage/log_
> >> analysis_pig_python_scripts/pig_1337299061301.log". Does this contain
> any
> >> kind of stack trace? Were you running the script in local mode or on a
> >> cluster? If the latter, there should be at least map task log output
> >> someplace that may also have some clues.
> >>
> >> Does path
> >> '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> contain SequenceFile<Text, Text> data? If not, you'll have to configure
> >> SequenceFileLoader further to properly deserialize the key-value pairs.
> >>
> >> Andy
> >>
> >>
> >> On Thu, May 17, 2012 at 5:07 PM, Chris Diehl <[EMAIL PROTECTED]> wrote:
> >>
> >> > Andy,
> >> >
> >> > Here's what I'm seeing when I run the following script. There's no
> >> > information beyond what is here in the log file.
> >> >
> >> > Chris
> >> >
> >> > REGISTER
> >> >
> >>
> '/opt/shared_storage/elephant-bird/build/elephant-bird-2.2.3-SNAPSHOT.jar';
> >> > %declare SEQFILE_LOADER
> >> > 'com.twitter.elephantbird.pig.load.SequenceFileLoader';
> >> > %declare TEXT_CONVERTER
> >> 'com.twitter.elephantbird.pig.util.TextConverter';
> >> > %declare NULL_CONVERTER
> >> > 'com.twitter.elephantbird.pig.util.NullWritableConverter'
> >> >
> >> > rmf /data/SearchLogJSON;
> >> >
> >> > -- Load raw log data
> >> > raw_logs = LOAD
> >> > '/logs/jive/internal/raw/2012/05/07/2012050795652.0627-720078349.seq'
> >> USING
> >> > $SEQFILE_LOADER ();
> >> >
> >> > -- Store the JSON
> >> > STORE raw_logs INTO '/data/SearchLogJSON/';
> >> >
> >> > -------------------
> >> >
> >> > -sh-3.2$ pig dump_log_json.pig
> >> > 2012-05-17 23:57:41,304 [main] INFO  org.apache.pig.Main - Logging
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB