Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro, mail # user - Joining Avro input files in using Java mapreduce


Copy link to this message
-
Re: Joining Avro input files in using Java mapreduce
Sripad Sriram 2013-04-28, 02:11
That makes a lot of sense - thanks, and I'll give it a shot!
On Fri, Apr 26, 2013 at 1:41 PM, Johannes Schulte <
[EMAIL PROTECTED]> wrote:

> Sripad,
>
> have you considered simply using a union of the two schemas as the input
> schema?
>
> Schema.createUnion(Lists.newArrayList(schema1,schema2));
>
> In the mapper you have to check for the record type / schema name /
> specificrecord instance to extract your join key, but otherwise it's really
> straightforward..
>
> Johannes
>
>
> On Thu, Apr 25, 2013 at 6:05 PM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:
>
>> I'm afraid I don't have an example -- the code I have is very entangled
>> with our internal stuff; it would take a while to extract the
>> general-purpose parts.
>>
>> I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for
>> mappers, since those are the types produced by AvroInputFormat:
>> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup
>>
>> The reducer input types are just your mapper output types, so you can
>> choose those yourself (any Hadoop writables).
>>
>> Martin
>>
>>
>> On 25 April 2013 08:26, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>
>>> Thanks! Martin, would you happen to have a gist of an example? Did you
>>> mean the reducer input is NullWritable?
>>>
>>> On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>> Oh, sorry, you're right. I was too hasty.
>>>
>>> One approach that I've used for joining Avro inputs is to use regular
>>> Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with
>>> MultipleInputs and AvroInputFormat. Your mapper input key type is then
>>> AvroWrapper<GenericRecord>, and mapper input value type is NullWritable.
>>> This approach uses Hadoop sequence files (rather than Avro files) between
>>> mappers and reducers, so you have to take care of serializing mapper output
>>> and unserializing reducer input yourself. It works, but you have to write
>>> quite a bit of annoying boilerplate code.
>>>
>>> I'd also be interested if anyone has a better solution. Perhaps we just
>>> need to create the AvroMultipleInputs that I thought existed, but doesn't :)
>>>
>>> Martin
>>>
>>>
>>> On 24 April 2013 12:02, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>>
>>>> Hey Martin,
>>>>
>>>> I think those classes refer to outputting to multiple files rather than
>>>> reading from multiple files, which is what's needed for a reduce-side join.
>>>>
>>>> thanks,
>>>> Sripad
>>>>
>>>>
>>>> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <
>>>> [EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hey Sripad,
>>>>>
>>>>> Take a look at AvroMultipleInputs.
>>>>>
>>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred version)
>>>>>
>>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce version)
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>> On 23 April 2013 17:01, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Hey folks,
>>>>>>
>>>>>> Aware that I can use Pig, Hive, etc to join avro files together, but
>>>>>> I have several use cases where I need to perform a reduce-side join on two
>>>>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any
>>>>>> thoughts?
>>>>>>
>>>>>> thanks!
>>>>>> Sripad
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>