Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - runtime exception when load and store multiple files using avro in pig


+
Danfeng Li 2012-08-21, 23:38
+
Cheolsoo Park 2012-08-22, 00:06
+
Danfeng Li 2012-08-22, 00:47
+
Cheolsoo Park 2012-08-22, 01:03
+
Alan Gates 2012-08-22, 01:26
Copy link to this message
-
RE: runtime exception when load and store multiple files using avro in pig
Danfeng Li 2012-08-22, 05:43
Hi, Cheolsoo,

If we can allow string as index, then it should be backward compatible and also give us ability to separate schema without the need to track them.

Thanks.
Dan

-----Original Message-----
From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, August 21, 2012 6:04 PM
To: [EMAIL PROTECTED]
Subject: Re: runtime exception when load and store multiple files using avro in pig

Hi Dan,

Glad to hear that it worked. I totally agree that AvroStorage can be improved. In fact, it was written for Pig 0.7, so it can be written much nicer now.

Only concern that I have is backward compatibility. That is, if I change syntax (I wanted so badly while working on AvroStorage recently), it will break backward compatibility. What I have been thinking is to rewrite AvroStorage in core Pig like HBaseStorage. For backward compatibility, we may keep the old version in Piggybank for a while and eventually retire it.

I am wondering what other people think. Please let me know if it is not a good idea to move AvroStorage to core Pig from Piggybank.

Thanks,
Cheolsoo

On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <[EMAIL PROTECTED]> wrote:

> Thanks, Cheolsoo. That solve my problems.
>
> It will be nice if pig can do this automatically when there are
> multiple avrostorage in the code. Otherwise, we have to manually track the numbers.
>
> Dan
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, August 21, 2012 5:06 PM
> To: [EMAIL PROTECTED]
> Subject: Re: runtime exception when load and store multiple files
> using avro in pig
>
> Hi Danfeng,
>
> The "long" is from the 1st AvroStorage store in your script. The
> AvroStorage has very funny syntax regarding multiple stores. To apply
> different avro schemas to multiple stores, you have to specify their
> "index" as follows:
>
> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
> into 'set1' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>
> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2
> into 'set2' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('index',
> '2');*
>
> As can be seen, I added the 'index' parameters.
>
> What AvroStorage does is to construct the following string in the frontend:
>
> "1#<1st avro schema>,2#<2nd avro schema>"
>
> and pass it to backend via UdfContext. Now in backend, tasks parse
> this string to get output schema for each store.
>
> Thanks,
> Cheolsoo
>
> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <[EMAIL PROTECTED]>
> wrote:
>
> > I run into this strange problem when try to load multiple text
> > formatted files and convert them into avro format using pig.
> > However, if I read and convert one file at a time in separated runs,
> > everything is fine. The error message is following
> >
> > 2012-08-21 19:15:32,964 [main] ERROR
> > org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
> > recreate exception from backed error:
> > org.apache.avro.file.DataFileWriter$AppendWriteException:
> > java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
> > union ["null","long"]
> >                 at
> > org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvr
> oRecordWriter.java:49)
> >                 at
> >
> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.
> java:612)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
> Format$PigRecordWriter.write(PigOutputFormat.java:139)
> >                 at
> >
> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutput
> Format$PigRecordWriter.write(PigOutputFormat.java:98)
> >                 at
> >
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTas
> k.java:531)
> >                 at
> >
> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutp