Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Working with changing schemas (avro) in Pig


Copy link to this message
-
Re: Working with changing schemas (avro) in Pig
Elephantbird has functionality to integrate with Protobufs and Thrift but
not Avro. When reading and writing messages of either type, EB expects
classes to be generated via schema definitions at build-time. It doesn't
read schemas defs at run-time to dynamically generate messages like one
would do with Avro. Hence EB takes a different approach and doesn't have to
deal with the evolving schema file in the same way as AvroStorage does.
On Sun, Apr 1, 2012 at 9:32 AM, Alex Rovner <[EMAIL PROTECTED]> wrote:

> Anyone have any experience with elephantbird? Seems like it can handle
> these cases with ease?
>
> Sent from my iPhone
>
> On Mar 30, 2012, at 12:59 AM, Bill Graham <[EMAIL PROTECTED]> wrote:
>
> > In the TestAvroStorage.testRecordWithFieldSchemaFromTextWithSchemaFile
> > there's an example:
> >
> > STORE avro2 INTO 'output_dir'
> > USING org.apache.pig.piggybank.storage.avro.AvroStorage (
> > '{"schema_file": "/path/to/schema/file" ,
> > "field0": "def:member_id",
> > "field1": "def:browser_id",
> > "field3": "def:act_content" }'
> > );
> >
> > You specify the file that contains the schema, then you have to map the
> > tuple fields to the name of the field in the avro schema. This mapping
> is a
> > drag, but it's currently required.
> >
> > Note that only the json-style constructor (as opposed to the string array
> > appoach) supports schema_file without this uncommitted patch:
> > https://issues.apache.org/jira/browse/PIG-2257
> >
> >
> > thanks,
> > Bill
> >
> > On Thu, Mar 29, 2012 at 1:05 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
> >
> >> That's nice! Can you give me an example of how to use it? I am not able
> to
> >> figure it out from the code. The schemaManager is only used at one place
> >> after that, and that is when the params contains a "field<number>" key.
> I
> >> don't understand that part. Is there a way I can call it simply like
> STORE
> >> xyz INTO 'abc' USING AvroStorage('schema_file=/path/to/schema/file')?
> >>
> >>
> >>
> >> On Wed, Mar 28, 2012 at 5:41 PM, Bill Graham <[EMAIL PROTECTED]>
> wrote:
> >>
> >>> Yes, the schema can be in HDFS but the documentation for this is
> lacking.
> >>> Search for 'schema_file' here:
> >>>
> >>>
> >>>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/main/java/org/apache/pig/piggybank/storage/avro/AvroStorage.java
> >>>
> >>> and here:
> >>>
> >>>
> >>>
> http://svn.apache.org/repos/asf/pig/trunk/contrib/piggybank/java/src/test/java/org/apache/pig/piggybank/test/storage/avro/TestAvroStorage.java
> >>>
> >>> And be aware of this open JIRA:
> >>> https://issues.apache.org/jira/browse/PIG-2257
> >>>
> >>> And this closed one:
> >>> https://issues.apache.org/jira/browse/PIG-2195
> >>>
> >>> :)
> >>>
> >>> thanks,
> >>> Bill
> >>>
> >>>
> >>> On Wed, Mar 28, 2012 at 5:26 PM, IGZ Nick <[EMAIL PROTECTED]> wrote:
> >>>
> >>>> The schema has to be written in the script right? I don't think there
> is
> >>>> any way the schema can be in a file outside the script. That was the
> >>>> messyness I was talking about. Or is there a way I can write the
> schema in
> >>>> a separate file? One way I see is to create and store a dummy file
> with the
> >>>> schema
> >>>>
> >>>>
> >>>> Wed, Mar 28, 2012 at 4:19 PM, Bill Graham <[EMAIL PROTECTED]>
> wrote:
> >>>>
> >>>>> The default value will be part of the new Avro schema definition and
> >>>>> Avro should return it to you, so there shouldn't be any code
> messyness with
> >>>>> that approach.
> >>>>>
> >>>>>
> >>>>> On Wed, Mar 28, 2012 at 4:01 PM, IGZ Nick <[EMAIL PROTECTED]>
> wrote:
> >>>>>
> >>>>>> Ok.. you mean I can just use the newer schema to read the old schema
> >>>>>> as well, by populating some default value for the missing field. I
> think
> >>>>>> that should work, messy code though!
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>> On Wed, Mar 28, 2012 at 3:53 PM, Bill Graham <[EMAIL PROTECTED]
> >wrote:
> >>>>>>
> >>>>>>> If you evolved your schema to just add fields, then you should be

*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB