Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Working with changing schemas (avro) in Pig


Copy link to this message
-
Re: Working with changing schemas (avro) in Pig
Bill Graham 2012-03-30, 04:59
In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
there's an example:

STORE avro2 INTO 'output_dir'
USING org.apache.pig.piggybank.storage.avro.AvroStorage (
'{"schema_file": "/path/to/schema/file" ,
 "field0": "def:member_id",
"field1": "def:browser_id",
"field3": "def:act_content" }'
);

You specify the file that contains the schema, then you have to map the
tuple fields to the name of the field in the avro schema. This mapping is a
drag, but it's currently required.

Note that only the json-style constructor (as opposed to the string array
appoach) supports schema_file without this uncommitted patch:
https://issues.apache.org/jira/browse/PIG-2257
thanks,
Bill

On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:

> That's nice! Can you give me an example of how to use it? I am not able to
> figure it out from the code. The schemaManager is only used at one place
> after that, and that is when the params contains a "field<number>" key. I
> don't understand that part. Is there a way I can call it simply like STORE
> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
>
>
>
> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
>> Yes, the schema can be in HDFS but the documentation for this is lacking.
>> Search for 'schema_file' here:
>>
>>
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
>>
>> and here:
>>
>>
>> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
>>
>> And be aware of this open JIRA:
>> https://issues.apache.org/jira/browse/PIG-2257
>>
>> And this closed one:
>> https://issues.apache.org/jira/browse/PIG-2195
>>
>> :)
>>
>> thanks,
>> Bill
>>
>>
>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
>>
>>> The schema has to be written in the script right? I don't think there is
>>> any way the schema can be in a file outside the script. That was the
>>> messyness I was talking about. Or is there a way I can write the schema in
>>> a separate file? One way I see is to create and store a dummy file with the
>>> schema
>>>
>>>
>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
>>>
>>>> The default value will be part of the new Avro schema definition and
>>>> Avro should return it to you, so there shouldn't be any code messyness with
>>>> that approach.
>>>>
>>>>
>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Ok.. you mean I can just use the newer schema to read the old schema
>>>>> as well, by populating some default value for the missing field. I think
>>>>> that should work, messy code though!
>>>>>
>>>>> Thanks!
>>>>>
>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[EMAIL PROTECTED]>wrote:
>>>>>
>>>>>>  If you evolved your schema to just add fields, then you should be
>>>>>> able to
>>>>>> use a single schema descriptor file to read both pre- and
>>>>>> post-evolved data
>>>>>> objects. This is because one of the rules of new fields in Avro is
>>>>>> that
>>>>>> they have to have a default value and be non-null. AvroStorage should
>>>>>> pick
>>>>>> that default field up for the old objects. If it doesn't, then that's
>>>>>> a bug.
>>>>>>
>>>>>>
>>>>>> On Wed, Mar 28, 2012 at 3:26 PM, IGZ Nick <[EMAIL PROTECTED]>
>>>>>> wrote:
>>>>>>
>>>>>> > @Bill,
>>>>>> > I did look at the option of providing input as a parameter while
>>>>>> > initializing AvroStorage(). But even then, I'll still need to
>>>>>> change my
>>>>>> > script to handle the two files because I'll still need to have
>>>>>> separate
>>>>>> > schemas right?
>>>>>> >
>>>>>> > @Stan,
>>>>>> > Thanks for pointing me to it, it is a useful feature. But in my
>>>>>> case, I
>>>>>> > would never have two input files with different schemas. The input
>>>>>> will
>>>>>> > always have only one of the schemas, but I want my new script (with
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*