Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> AvroStorage Default values are set to null even if they are specified


+
Viraj Bhat 2013-05-03, 03:34
Copy link to this message
-
Re: AvroStorage Default values are set to null even if they are specified
Hi Viray,

Yes, that's a known bug. Here is what happens:

1) Let's say there are two schema X and Y.
2) AvroStorage creates a tuple whose size == max( sizeOf(X), sizeOf(Y) ).
3) Fields are filled in as values are read. But if no values are found,
those fields are left as null.

If you'd like to fix it, please take a look at PigAvroRecordReader.java:
http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/PigAvroRecordReader.java

In particular, see how mProtoTuple is initialized and updated.

Thanks,
Cheolsoo

On Thu, May 2, 2013 at 8:34 PM, Viraj Bhat <[EMAIL PROTECTED]> wrote:

> Hi Cheolsoo/Pig User Group,
>   I am using the Pig 0.11 piggybank - AvroStorage. When merging multiple
> schemas where default values have been specified in the avro schema; The
> AvroStorage puts nulls in the merged data set.
> Is this a known bug in the current implementation of the AvroStorage.
> Using an example provided by one of my colleagues. The final dataset should
> contain "NU", 0, "OU" for all values where the columns do not exist.
> ==> Employee3.avro <=> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0 },
>         {"name" : "dept", "type": "string", "default" : "DU"}
> ]
> }
>
> ==> Employee4.avro <=> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "age", "type" : "int", "default" : 0},
>         {"name" : "dept", "type": "string", "default" : "DU"},
>         {"name" : "office", "type": "string", "default" : "OU"}
> ]
> }
>
> ==> Employee6.avro <=> {
> "type" : "record",
> "name" : "employee",
> "fields":[
>         {"name" : "name", "type" : "string", "default" : "NU"},
>         {"name" : "lastname", "type": "string", "default" : "LNU"},
>         {"name" : "age", "type" : "int","default" : 0},
>         {"name" : "salary", "type": "int", "default" : 0},
>         {"name" : "dept", "type": "string","default" : "DU"},
>         {"name" : "office", "type": "string","default" : "OU"}
> ]
> }
>
> The pig script:
> employee = load '$input' using
> org.apache.pig.piggybank.storage.avro.AvroStorage('multiple_schemas');
> describe employee;
> dump employee;
>
> The call:
> dump_employees.pig employee{3,4,6}.ser
>
> The output:
> employee: {name: chararray,age: int,dept: chararray,lastname:
> chararray,salary: int,office: chararray}
>
> (Milo,30,DH,,,)
> (Asmya,34,PQ,,,)
> (Baljit,23,RS,,,)
> (Pune,60,Astrophysics,Warriors,5466,UTA)
> (Rajsathan,20,Biochemistry,Royals,1378,Stanford)
> (Chennai,50,Microbiology,Superkings,7338,Hopkins)
> (Mumbai,20,Applied Math,Indians,4468,UAH)
> (Praj,54,RMX,,,Champaign)
> (Buba,767,HD,,,Sunnyvale)
> (Manku,375,MS,,,New York)
> Regards
> Viraj
>
> -----Original Message-----
> From: Cheolsoo Park [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, April 30, 2013 9:10 PM
> To: [EMAIL PROTECTED]
> Cc: Qi, Runping
> Subject: Re: Override input schema in AvroStorage
>
> Hi Steven,
>
> The new AvroStorage will let you specify the input schema:
> https://issues.apache.org/jira/browse/PIG-3015
>
> In fact, somebody made the same request in a comment of the jira that I am
> copying and pasting below:
>
> Furthermore, we occasionally have issues with pig jobs picking the old
> > schema when we have a schema update. Manually specifying the schema
> > would fix this and give us more flexibility in defining the data we
> > want pig to pull from a file.
>
>
> This jira is work in progress, but hopefully it will be in next major
> released.
>
> Thanks,
> Cheolsoo
>
>
>
> On Sat, Apr 27, 2013 at 3:24 PM, Enns, Steven <[EMAIL PROTECTED]> wrote:
>
> > Resending now that I am subscribed :)
> >
> > On 4/25/13 4:01 PM, "Enns, Steven" <[EMAIL PROTECTED]> wrote:
> >
> > >Hi everyone,
> > >
> > >I would like to override the input schema in AvroStorage to make a
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB