Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Joining Avro input files in using Java mapreduce


Copy link to this message
-
Re: Joining Avro input files in using Java mapreduce
Martin Kleppmann 2013-04-25, 16:05
I'm afraid I don't have an example -- the code I have is very entangled
with our internal stuff; it would take a while to extract the
general-purpose parts.

I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for mappers,
since those are the types produced by AvroInputFormat:
http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup

The reducer input types are just your mapper output types, so you can
choose those yourself (any Hadoop writables).

Martin
On 25 April 2013 08:26, Sripad Sriram <[EMAIL PROTECTED]> wrote:

> Thanks! Martin, would you happen to have a gist of an example? Did you
> mean the reducer input is NullWritable?
>
> On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[EMAIL PROTECTED]>
> wrote:
>
> Oh, sorry, you're right. I was too hasty.
>
> One approach that I've used for joining Avro inputs is to use regular
> Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with
> MultipleInputs and AvroInputFormat. Your mapper input key type is then
> AvroWrapper<GenericRecord>, and mapper input value type is NullWritable.
> This approach uses Hadoop sequence files (rather than Avro files) between
> mappers and reducers, so you have to take care of serializing mapper output
> and unserializing reducer input yourself. It works, but you have to write
> quite a bit of annoying boilerplate code.
>
> I'd also be interested if anyone has a better solution. Perhaps we just
> need to create the AvroMultipleInputs that I thought existed, but doesn't :)
>
> Martin
>
>
> On 24 April 2013 12:02, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>
>> Hey Martin,
>>
>> I think those classes refer to outputting to multiple files rather than
>> reading from multiple files, which is what's needed for a reduce-side join.
>>
>> thanks,
>> Sripad
>>
>>
>> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
>>
>>> Hey Sripad,
>>>
>>> Take a look at AvroMultipleInputs.
>>>
>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred version)
>>>
>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce version)
>>>
>>> Martin
>>>
>>>
>>> On 23 April 2013 17:01, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hey folks,
>>>>
>>>> Aware that I can use Pig, Hive, etc to join avro files together, but I
>>>> have several use cases where I need to perform a reduce-side join on two
>>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any
>>>> thoughts?
>>>>
>>>> thanks!
>>>> Sripad
>>>>
>>>
>>>
>>
>