Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Joining Avro input files in using Java mapreduce


Copy link to this message
-
Re: Joining Avro input files in using Java mapreduce
Sripad,

have you considered simply using a union of the two schemas as the input
schema?

Schema.createUnion(Lists.newArrayList(schema1,schema2));

In the mapper you have to check for the record type / schema name /
specificrecord instance to extract your join key, but otherwise it's really
straightforward..

Johannes
On Thu, Apr 25, 2013 at 6:05 PM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:

> I'm afraid I don't have an example -- the code I have is very entangled
> with our internal stuff; it would take a while to extract the
> general-purpose parts.
>
> I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for
> mappers, since those are the types produced by AvroInputFormat:
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup
>
> The reducer input types are just your mapper output types, so you can
> choose those yourself (any Hadoop writables).
>
> Martin
>
>
> On 25 April 2013 08:26, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>
>> Thanks! Martin, would you happen to have a gist of an example? Did you
>> mean the reducer input is NullWritable?
>>
>> On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[EMAIL PROTECTED]>
>> wrote:
>>
>> Oh, sorry, you're right. I was too hasty.
>>
>> One approach that I've used for joining Avro inputs is to use regular
>> Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with
>> MultipleInputs and AvroInputFormat. Your mapper input key type is then
>> AvroWrapper<GenericRecord>, and mapper input value type is NullWritable.
>> This approach uses Hadoop sequence files (rather than Avro files) between
>> mappers and reducers, so you have to take care of serializing mapper output
>> and unserializing reducer input yourself. It works, but you have to write
>> quite a bit of annoying boilerplate code.
>>
>> I'd also be interested if anyone has a better solution. Perhaps we just
>> need to create the AvroMultipleInputs that I thought existed, but doesn't :)
>>
>> Martin
>>
>>
>> On 24 April 2013 12:02, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>
>>> Hey Martin,
>>>
>>> I think those classes refer to outputting to multiple files rather than
>>> reading from multiple files, which is what's needed for a reduce-side join.
>>>
>>> thanks,
>>> Sripad
>>>
>>>
>>> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <[EMAIL PROTECTED]
>>> > wrote:
>>>
>>>> Hey Sripad,
>>>>
>>>> Take a look at AvroMultipleInputs.
>>>>
>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred version)
>>>>
>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce version)
>>>>
>>>> Martin
>>>>
>>>>
>>>> On 23 April 2013 17:01, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> Aware that I can use Pig, Hive, etc to join avro files together, but I
>>>>> have several use cases where I need to perform a reduce-side join on two
>>>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any
>>>>> thoughts?
>>>>>
>>>>> thanks!
>>>>> Sripad
>>>>>
>>>>
>>>>
>>>
>>
>