Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> runtime exception when load and store multiple files using avro in pig


Copy link to this message
-
Re: runtime exception when load and store multiple files using avro in pig
Hi Dan,

Glad to hear that it worked. I totally agree that AvroStorage can be
improved. In fact, it was written for Pig 0.7, so it can be written much
nicer now.

Only concern that I have is backward compatibility. That is, if I change
syntax (I wanted so badly while working on AvroStorage recently), it will
break backward compatibility. What I have been thinking is to
rewrite AvroStorage in core Pig like HBaseStorage. For
backward compatibility, we may keep the old version in Piggybank for a
while and eventually retire it.

I am wondering what other people think. Please let me know if it is not a
good idea to move AvroStorage to core Pig from Piggybank.

Thanks,
Cheolsoo

On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <[EMAIL PROTECTED]> wrote:

> Thanks, Cheolsoo. That solve my problems.
>
> It will be nice if pig can do this automatically when there are multiple
> avrostorage in the code. Otherwise, we have to manually track the numbers.
>
> Dan
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, August 21, 2012 5:06 PM
> To: [EMAIL PROTECTED]
> Subject: Re: runtime exception when load and store multiple files using
> avro in pig
>
> Hi Danfeng,
>
> The "long" is from the 1st AvroStorage store in your script. The
> AvroStorage has very funny syntax regarding multiple stores. To apply
> different avro schemas to multiple stores, you have to specify their
> "index" as follows:
>
> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
> into 'set1' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>
> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into
> 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
> '2');*
>
> As can be seen, I added the 'index' parameters.
>
> What AvroStorage does is to construct the following string in the frontend:
>
> "1#<1st avro schema>,2#<2nd avro schema>"
>
> and pass it to backend via UdfContext. Now in backend, tasks parse this
> string to get output schema for each store.
>
> Thanks,
> Cheolsoo
>
> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <[EMAIL PROTECTED]>
> wrote:
>
> > I run into this strange problem when try to load multiple text
> > formatted files and convert them into avro format using pig. However,
> > if I read and convert one file at a time in separated runs, everything
> > is fine. The error message is following
> >
> > 2012-08-21 19:15:32,964 [main] ERROR
> > org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> > recreate exception from backed error:
> > org.apache.avro.file.DataFileWriter$AppendWriteException:
> > java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
> > union ["null","long"]
> >                 at
> > org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
> >                 at
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
> >                 at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java:48)
> >                 at
> > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGeneri
> > cMapB
> >
> > my code is
> > set1 = load '$input_dir/set1.txt' using PigStorage('|') as (
> >    id:long,
> >    f1:long,
> >    f2:chararray,
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB