Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Globbing several AVRO files with different (extended) schemes

Copy link to this message
Re: Globbing several AVRO files with different (extended) schemes
There is a patch for AvroStorage which computes a union schema thereby
allowing input avro files having different
schemas, specifically (un-nested) records with different fields.




On Wed, Mar 21, 2012 at 8:31 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> A question about this: does Avro have clear cut rules for how to
> essentially merge two arbitrary JSON schemas?
> 2012/3/21 Jonathan Coveney <[EMAIL PROTECTED]>
>> ATM, there is no quick and easy solution short of patching Pig... feel
>> free to make a ticket.
>> Short of that, what you can do is load each relation with a different
>> schema separately, and then do a union of it. Given that there might be a
>> lot of different relations and schemas involved, you could probably make a
>> script to do this for you... but yeah, the long term approach is to patch
>> AvroStorage.
>> 2012/3/21 Markus Resch <[EMAIL PROTECTED]>
>>> Hi guys,
>>> Thanks again for your awesome hint about sqoop.
>>> I have another question: The data I'm working with is stored as AVRO
>>> Files in the Hadoop. When I try to glob them everything works just
>>> perfectly. But. When I add something to the schema of a single data file
>>> while the others remain, everything gets wrecked:
>>> "currently we assume all avro files under the same "location"
>>>     * share the same schema and will throw exception if not."
>>> (e.g. I add a new data field) Expected behavior for me would be: If I'm
>>> globbing several files with slightly different schema the result of the
>>> LOAD would be either return an intersection of all valid fields that are
>>> common to both schemes or the atoms of the missing fields are nulled.
>>> How could I handle this properly?
>>> Thanks
>>> Markus