Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Working with changing schemas (avro) in Pig


Copy link to this message
-
Re: Working with changing schemas (avro) in Pig
Yes, the schema can be in HDFS but the documentation for this is lacking.
Search for 'schema_file' here:

http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java

and here:

http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java

And be aware of this open JIRA:
https://issues.apache.org/jira/browse/PIG-2257

And this closed one:
https://issues.apache.org/jira/browse/PIG-2195

:)

thanks,
Bill

On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:

> The schema has to be written in the script right? I don't think there is
> any way the schema can be in a file outside the script. That was the
> messyness I was talking about. Or is there a way I can write the schema in
> a separate file? One way I see is to create and store a dummy file with the
> schema
>
>
> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
>> The default value will be part of the new Avro schema definition and Avro
>> should return it to you, so there shouldn't be any code messyness with that
>> approach.
>>
>>
>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
>>
>>> Ok.. you mean I can just use the newer schema to read the old schema as
>>> well, by populating some default value for the missing field. I think that
>>> should work, messy code though!
>>>
>>> Thanks!
>>>
>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[EMAIL PROTECTED]>wrote:
>>>
>>>>  If you evolved your schema to just add fields, then you should be able
>>>> to
>>>> use a single schema descriptor file to read both pre- and post-evolved
>>>> data
>>>> objects. This is because one of the rules of new fields in Avro is that
>>>> they have to have a default value and be non-null. AvroStorage should
>>>> pick
>>>> that default field up for the old objects. If it doesn't, then that's a
>>>> bug.
>>>>
>>>>
>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
>>>>
>>>> > @Bill,
>>>> > I did look at the option of providing input as a parameter while
>>>> > initializing AvroStorage(). But even then, I'll still need to change
>>>> my
>>>> > script to handle the two files because I'll still need to have
>>>> separate
>>>> > schemas right?
>>>> >
>>>> > @Stan,
>>>> > Thanks for pointing me to it, it is a useful feature. But in my case,
>>>> I
>>>> > would never have two input files with different schemas. The input
>>>> will
>>>> > always have only one of the schemas, but I want my new script (with
>>>> the
>>>> > additional column) to be able to process the old data as well, even
>>>> if the
>>>> > input only contains data with the older schema.
>>>> >
>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>> [EMAIL PROTECTED]
>>>> > >wrote:
>>>> >
>>>> > > There is a patch for Avro to deal with this use case:
>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>> > > (See the attached pig example which loads two avro input files with
>>>> > > different schemas.)
>>>> > >
>>>> > > Best,
>>>> > >
>>>> > > stan
>>>> > >
>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[EMAIL PROTECTED]>
>>>> wrote:
>>>> > > > Hi guys,
>>>> > > >
>>>> > > > I use Pig to process some clickstream data. I need to track a new
>>>> > field,
>>>> > > so
>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>> > > > accordingly. It works fine with the new files (which have that new
>>>> > > column)
>>>> > > > but it breaks when I run it on my old files which do not have that
>>>> > column
>>>> > > > in the schema (since avro stores schema in the data files
>>>> itself). I
>>>> > was
>>>> > > > expecting that Pig will assume the field to be null if that
>>>> particular
>>>> > > > field does not exist. But now I am having to maintain separate
>>>> scripts
>>>> > to
>>>> > > > process the old and new files. Is there any workaround this?
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB