Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Globbing several AVRO files with different (extended) schemes


Copy link to this message
-
Re: Globbing several AVRO files with different (extended) schemes
There is a patch for AvroStorage which computes a union schema thereby
allowing input avro files having different
schemas, specifically (un-nested) records with different fields.

https://issues.apache.org/jira/browse/PIG-2579

Best,

stan

On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> A question about this: does Avro have clear cut rules for how to
> essentially merge two arbitrary JSON schemas?
>
> 2012/3/21 Jonathan Coveney <[EMAIL PROTECTED]>
>
>> ATM, there is no quick and easy solution short of patching Pig... feel
>> free to make a ticket.
>>
>> Short of that, what you can do is load each relation with a different
>> schema separately, and then do a union of it. Given that there might be a
>> lot of different relations and schemas involved, you could probably make a
>> script to do this for you... but yeah, the long term approach is to patch
>> AvroStorage.
>>
>>
>> 2012/3/21 Markus Resch <[EMAIL PROTECTED]>
>>
>>> Hi guys,
>>>
>>> Thanks again for your awesome hint about sqoop.
>>>
>>> I have another question: The data I'm working with is stored as AVRO
>>> Files in the Hadoop. When I try to glob them everything works just
>>> perfectly. But. When I add something to the schema of a single data file
>>> while the others remain, everything gets wrecked:
>>>
>>> "currently we assume all avro files under the same "location"
>>>     * share the same schema and will throw exception if not."
>>>
>>> (e.g. I add a new data field) Expected behavior for me would be: If I'm
>>> globbing several files with slightly different schema the result of the
>>> LOAD would be either return an intersection of all valid fields that are
>>> common to both schemes or the atoms of the missing fields are nulled.
>>>
>>> How could I handle this properly?
>>>
>>> Thanks
>>>
>>> Markus
>>>
>>>
>>>
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB