Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Avro >> mail # user >> Reduce-side joins in Avro M/R


+
Andrew Kenworthy 2011-12-06, 16:43
+
Scott Carey 2011-12-07, 18:40
+
Andrew Kenworthy 2011-12-13, 16:46
Copy link to this message
-
Re: Reduce-side joins in Avro M/R
The overhead of checking the union is not that high, but it would be useful
to be able to specify a map of different Avro schemas to source paths for a
variety of use cases.  I am not sure to what extent that is possible with
the current Avro mapreduce API.

There are some folks working on making improved Avro mapreduce/mapred APIs
with the intention of eventually contributing it back to Avro.  You might
get some good ideas from there:
https://issues.apache.org/jira/browse/AVRO-593
https://github.com/wibidata/odiago-avro
On 12/13/11 8:46 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote:

> I'm currently using a UNION-schema to map two different types of data (read
> from two different input paths) in my reducer to a common record. This works
> fine, but - if I have understood the mechanism correctly - it would mean that
> Avro is having to check each and every record against my UNION schema. With a
> "normal" reduce-side join, I could use MultipleInputs to specify a mapper for
> each input, thus letting them run independently (since each mapper knows its
> input) with presumably less overhead.
>
> Is it possible with Avro to avoid the overhead of checking each input row
> against the union schema?
>
> Thanks,
>
> Andrew
>
>>  
>>  
>>  
>>
>>   From: Scott Carey <[EMAIL PROTECTED]>
>>  To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Andrew Kenworthy
>> <[EMAIL PROTECTED]>
>>  Sent: Wednesday, December 7, 2011 7:40 PM
>>  Subject: Re: Reduce-side joins in Avro M/R
>>  
>> This should be conceptually the same as a normal map-reduce join of the same
>> type.  Avro handles the serialization, but not the map-reduce algorithm or
>> strategy.  
>>
>> On 12/6/11 8:43 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I'd like to use reduce-side joins in an avro M/R job, and am not sure how to
>>> do it: are there any best-practice tips or outlines of what one would have
>>> to implement in order to make this possible?
>>>
>>> Thanks,
>>>
>>> Andrew Kenworthy
>>
>>
>>  
>>  
>>  
>    
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB