Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - runtime exception when load and store multiple files using avro in pig


+
Danfeng Li 2012-08-21, 23:38
+
Cheolsoo Park 2012-08-22, 00:06
+
Danfeng Li 2012-08-22, 00:47
+
Cheolsoo Park 2012-08-22, 01:03
Copy link to this message
-
Re: runtime exception when load and store multiple files using avro in pig
Alan Gates 2012-08-22, 01:26
Moving it into core makes sense to me, as Avro is a format we should be supporting.

Alan.

On Aug 21, 2012, at 6:03 PM, Cheolsoo Park wrote:

> Hi Dan,
>
> Glad to hear that it worked. I totally agree that AvroStorage can be
> improved. In fact, it was written for Pig 0.7, so it can be written much
> nicer now.
>
> Only concern that I have is backward compatibility. That is, if I change
> syntax (I wanted so badly while working on AvroStorage recently), it will
> break backward compatibility. What I have been thinking is to
> rewrite AvroStorage in core Pig like HBaseStorage. For
> backward compatibility, we may keep the old version in Piggybank for a
> while and eventually retire it.
>
> I am wondering what other people think. Please let me know if it is not a
> good idea to move AvroStorage to core Pig from Piggybank.
>
> Thanks,
> Cheolsoo
>
> On Tue, Aug 21, 2012 at 5:47 PM, Danfeng Li <[EMAIL PROTECTED]> wrote:
>
>> Thanks, Cheolsoo. That solve my problems.
>>
>> It will be nice if pig can do this automatically when there are multiple
>> avrostorage in the code. Otherwise, we have to manually track the numbers.
>>
>> Dan
>>
>> -----Original Message-----
>> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
>> Sent: Tuesday, August 21, 2012 5:06 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: runtime exception when load and store multiple files using
>> avro in pig
>>
>> Hi Danfeng,
>>
>> The "long" is from the 1st AvroStorage store in your script. The
>> AvroStorage has very funny syntax regarding multiple stores. To apply
>> different avro schemas to multiple stores, you have to specify their
>> "index" as follows:
>>
>> set1 = load 'input1.txt' using PigStorage('|') as ( ... ); *store set1
>> into 'set1' using
>> org.apache.pig.piggybank.storage.avro.AvroStorage('index', '1');*
>>
>> set2 = load 'input2.txt' using PigStorage('|') as ( .. ); *store set2 into
>> 'set2' using org.apache.pig.piggybank.storage.avro.AvroStorage('index',
>> '2');*
>>
>> As can be seen, I added the 'index' parameters.
>>
>> What AvroStorage does is to construct the following string in the frontend:
>>
>> "1#<1st avro schema>,2#<2nd avro schema>"
>>
>> and pass it to backend via UdfContext. Now in backend, tasks parse this
>> string to get output schema for each store.
>>
>> Thanks,
>> Cheolsoo
>>
>> On Tue, Aug 21, 2012 at 4:38 PM, Danfeng Li <[EMAIL PROTECTED]>
>> wrote:
>>
>>> I run into this strange problem when try to load multiple text
>>> formatted files and convert them into avro format using pig. However,
>>> if I read and convert one file at a time in separated runs, everything
>>> is fine. The error message is following
>>>
>>> 2012-08-21 19:15:32,964 [main] ERROR
>>> org.apache.pig.tools.grunt.GruntParser - ERROR 2997: Unable to
>>> recreate exception from backed error:
>>> org.apache.avro.file.DataFileWriter$AppendWriteException:
>>> java.lang.RuntimeException: Datum 1980-01-01 00:00:00.000 is not in
>>> union ["null","long"]
>>>                at
>>> org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:263)
>>>                at
>>>
>> org.apache.pig.piggybank.storage.avro.PigAvroRecordWriter.write(PigAvroRecordWriter.java:49)
>>>                at
>>>
>> org.apache.pig.piggybank.storage.avro.AvroStorage.putNext(AvroStorage.java:612)
>>>                at
>>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:139)
>>>                at
>>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat$PigRecordWriter.write(PigOutputFormat.java:98)
>>>                at
>>>
>> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.write(MapTask.java:531)
>>>                at
>>>
>> org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
>>>                at
>>>
>> org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapOnly$Map.collect(PigMapOnly.java
+
Danfeng Li 2012-08-22, 05:43