Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Avro >> mail # user >> Joining Avro input files in using Java mapreduce


Copy link to this message
-
Re: Joining Avro input files in using Java mapreduce
Sripad,

have you considered simply using a union of the two schemas as the input
schema?

Schema.createUnion(Lists.newArrayList(schema1,schema2));

In the mapper you have to check for the record type / schema name /
specificrecord instance to extract your join key, but otherwise it's really
straightforward..

Johannes
On Thu, Apr 25, 2013 at 6:05 PM, Martin Kleppmann <[EMAIL PROTECTED]>wrote:

> I'm afraid I don't have an example -- the code I have is very entangled
> with our internal stuff; it would take a while to extract the
> general-purpose parts.
>
> I do mean <AvroWrapper<GenericRecord>, NullWritable> as input for
> mappers, since those are the types produced by AvroInputFormat:
> http://svn.apache.org/viewvc/avro/trunk/lang/java/mapred/src/main/java/org/apache/avro/mapred/AvroInputFormat.java?view=markup
>
> The reducer input types are just your mapper output types, so you can
> choose those yourself (any Hadoop writables).
>
> Martin
>
>
> On 25 April 2013 08:26, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>
>> Thanks! Martin, would you happen to have a gist of an example? Did you
>> mean the reducer input is NullWritable?
>>
>> On Apr 25, 2013, at 7:44 AM, Martin Kleppmann <[EMAIL PROTECTED]>
>> wrote:
>>
>> Oh, sorry, you're right. I was too hasty.
>>
>> One approach that I've used for joining Avro inputs is to use regular
>> Hadoop mappers and reducers (instead of AvroMapper/AvroReducer) with
>> MultipleInputs and AvroInputFormat. Your mapper input key type is then
>> AvroWrapper<GenericRecord>, and mapper input value type is NullWritable.
>> This approach uses Hadoop sequence files (rather than Avro files) between
>> mappers and reducers, so you have to take care of serializing mapper output
>> and unserializing reducer input yourself. It works, but you have to write
>> quite a bit of annoying boilerplate code.
>>
>> I'd also be interested if anyone has a better solution. Perhaps we just
>> need to create the AvroMultipleInputs that I thought existed, but doesn't :)
>>
>> Martin
>>
>>
>> On 24 April 2013 12:02, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>
>>> Hey Martin,
>>>
>>> I think those classes refer to outputting to multiple files rather than
>>> reading from multiple files, which is what's needed for a reduce-side join.
>>>
>>> thanks,
>>> Sripad
>>>
>>>
>>> On Wed, Apr 24, 2013 at 3:35 AM, Martin Kleppmann <[EMAIL PROTECTED]
>>> > wrote:
>>>
>>>> Hey Sripad,
>>>>
>>>> Take a look at AvroMultipleInputs.
>>>>
>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapred/AvroMultipleOutputs.html(mapred version)
>>>>
>>>> http://avro.apache.org/docs/1.7.4/api/java/org/apache/avro/mapreduce/AvroMultipleOutputs.html(mapreduce version)
>>>>
>>>> Martin
>>>>
>>>>
>>>> On 23 April 2013 17:01, Sripad Sriram <[EMAIL PROTECTED]> wrote:
>>>>
>>>>> Hey folks,
>>>>>
>>>>> Aware that I can use Pig, Hive, etc to join avro files together, but I
>>>>> have several use cases where I need to perform a reduce-side join on two
>>>>> avro files. MultipleInputs doesn't seem to like AvroInputFormat - any
>>>>> thoughts?
>>>>>
>>>>> thanks!
>>>>> Sripad
>>>>>
>>>>
>>>>
>>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB