Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - AvroStorage Default values are set to null even if they are specified


+
Viraj Bhat 2013-05-03, 03:34
Copy link to this message
-
Re: AvroStorage Default values are set to null even if they are specified
Cheolsoo Park 2013-05-03, 05:00
Hi Viray,

Yes, that's a known bug. Here is what happens:

1) Let's say there are two schema X and Y.
2) AvroStorage creates a tuple whose size == max( sizeOf(X), sizeOf(Y) ).
3) Fields are filled in as values are read. But if no values are found,
those fields are left as null.

If you'd like to fix it, please take a look at PigAvroRecordReader.java:
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java

In particular, see how mProtoTuple is initialized and updated.

Thanks,
Cheolsoo

On Thu, May 2, 2013 at 8:34 PM, Viraj Bhat <[EMAIL PROTECTED]> wrote:

> Hi Cheolsoo/Pig User Group,
>   I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple
> schemas where default values have been specified in the avro schema; The
> AvroStorage puts nulls in the merged data set.
> Is this a known bug in the current implementation of the AvroStorage.
> Using an example provided by one of my colleagues. The final dataset should
> contain "NU", 0, "OU" for all values where the columns do not exist.
> ==> Employee3.avro <=> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0 },
>         {"name" : "dept", "type": "string", "default" : "DU"}
> ]
> }
>
> ==> Employee4.avro <=> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0},
>         {"name" : "dept", "type": "string", "default" : "DU"},
>         {"name" : "office", "type": "string", "default" : "OU"}
> ]
> }
>
> ==> Employee6.avro <=> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "lastname", "type": "string", "default" : "LNU"},
>         {"name" : "age", "type" : "int","default" : 0},
>         {"name" : "salary", "type": "int", "default" : 0},
>         {"name" : "dept", "type": "string","default" : "DU"},
>         {"name" : "office", "type": "string","default" : "OU"}
> ]
> }
>
> The pig script:
> employee = load '$input' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
> describe employee;
> dump employee;
>
> The call:
> dump_employees.pig employee{3,4,6}.ser
>
> The output:
> employee: {name: chararray,age: int,dept: chararray,lastname:
> chararray,salary: int,office: chararray}
>
> (Milo,30,DH,,,)
> (Asmya,34,PQ,,,)
> (Baljit,23,RS,,,)
> (Pune,60,Astrophysics,Warriors,5466,UTA)
> (Rajsathan,20,Biochemistry,Royals,1378,Stanford)
> (Chennai,50,Microbiology,Superkings,7338,Hopkins)
> (Mumbai,20,Applied Math,Indians,4468,UAH)
> (Praj,54,RMX,,,Champaign)
> (Buba,767,HD,,,Sunnyvale)
> (Manku,375,MS,,,New York)
> Regards
> Viraj
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 30, 2013 9:10 PM
> To: [EMAIL PROTECTED]
> Cc: Qi, Runping
> Subject: Re: Override input schema in AvroStorage
>
> Hi Steven,
>
> The new AvroStorage will let you specify the input schema:
> https://issues.apache.org/jira/browse/PIG-3015
>
> In fact, somebody made the same request in a comment of the jira that I am
> copying and pasting below:
>
> Furthermore, we occasionally have issues with pig jobs picking the old
> > schema when we have a schema update. Manually specifying the schema
> > would fix this and give us more flexibility in defining the data we
> > want pig to pull from a file.
>
>
> This jira is work in progress, but hopefully it will be in next major
> released.
>
> Thanks,
> Cheolsoo
>
>
>
> On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <[EMAIL PROTECTED]> wrote:
>
> > Resending now that I am subscribed :)
> >
> > On 4/25/13 4:01 PM, "Enns, Steven" <[EMAIL PROTECTED]> wrote:
> >
> > >Hi everyone,
> > >
> > >I would like to override the input schema in AvroStorage to make a