Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Working with changing schemas (avro) in Pig


Copy link to this message
-
Re: Working with changing schemas (avro) in Pig
Bill Graham 2012-03-29, 00:41
Yes, the schema can be in HDFS but the documentation for this is lacking.
Search for 'schema_file' here:

http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java

and here:

http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java

And be aware of this open JIRA:
https://issues.apache.org/jira/browse/PIG-2257

And this closed one:
https://issues.apache.org/jira/browse/PIG-2195

:)

thanks,
Bill

On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:

> The schema has to be written in the script right? I don't think there is
> any way the schema can be in a file outside the script. That was the
> messyness I was talking about. Or is there a way I can write the schema in
> a separate file? One way I see is to create and store a dummy file with the
> schema
>
>
> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
>> The default value will be part of the new Avro schema definition and Avro
>> should return it to you, so there shouldn't be any code messyness with that
>> approach.
>>
>>
>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
>>
>>> Ok.. you mean I can just use the newer schema to read the old schema as
>>> well, by populating some default value for the missing field. I think that
>>> should work, messy code though!
>>>
>>> Thanks!
>>>
>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[EMAIL PROTECTED]>wrote:
>>>
>>>>  If you evolved your schema to just add fields, then you should be able
>>>> to
>>>> use a single schema descriptor file to read both pre- and post-evolved
>>>> data
>>>> objects. This is because one of the rules of new fields in Avro is that
>>>> they have to have a default value and be non-null. AvroStorage should
>>>> pick
>>>> that default field up for the old objects. If it doesn't, then that's a
>>>> bug.
>>>>
>>>>
>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
>>>>
>>>> > @Bill,
>>>> > I did look at the option of providing input as a parameter while
>>>> > initializing AvroStorage(). But even then, I'll still need to change
>>>> my
>>>> > script to handle the two files because I'll still need to have
>>>> separate
>>>> > schemas right?
>>>> >
>>>> > @Stan,
>>>> > Thanks for pointing me to it, it is a useful feature. But in my case,
>>>> I
>>>> > would never have two input files with different schemas. The input
>>>> will
>>>> > always have only one of the schemas, but I want my new script (with
>>>> the
>>>> > additional column) to be able to process the old data as well, even
>>>> if the
>>>> > input only contains data with the older schema.
>>>> >
>>>> > On Wed, Mar 28, 2012 at 3:00 PM, Stan Rosenberg <
>>>> [EMAIL PROTECTED]
>>>> > >wrote:
>>>> >
>>>> > > There is a patch for Avro to deal with this use case:
>>>> > > https://issues.apache.org/jira/browse/PIG-2579
>>>> > > (See the attached pig example which loads two avro input files with
>>>> > > different schemas.)
>>>> > >
>>>> > > Best,
>>>> > >
>>>> > > stan
>>>> > >
>>>> > > On Wed, Mar 28, 2012 at 4:22 PM, IGZ Nick <[EMAIL PROTECTED]>
>>>> wrote:
>>>> > > > Hi guys,
>>>> > > >
>>>> > > > I use Pig to process some clickstream data. I need to track a new
>>>> > field,
>>>> > > so
>>>> > > > I added a new field to my avro schema, and changed my Pig script
>>>> > > > accordingly. It works fine with the new files (which have that new
>>>> > > column)
>>>> > > > but it breaks when I run it on my old files which do not have that
>>>> > column
>>>> > > > in the schema (since avro stores schema in the data files
>>>> itself). I
>>>> > was
>>>> > > > expecting that Pig will assume the field to be null if that
>>>> particular
>>>> > > > field does not exist. But now I am having to maintain separate
>>>> scripts
>>>> > to
>>>> > > > process the old and new files. Is there any workaround this?
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*