-Re: Reduce-side joins in Avro M/R
Scott Carey 2012-01-05, 23:20
The overhead of checking the union is not that high, but it would be useful
to be able to specify a map of different Avro schemas to source paths for a
variety of use cases. I am not sure to what extent that is possible with
the current Avro mapreduce API.
There are some folks working on making improved Avro mapreduce/mapred APIs
with the intention of eventually contributing it back to Avro. You might
get some good ideas from there:
On 12/13/11 8:46 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote:
> I'm currently using a UNION-schema to map two different types of data (read
> from two different input paths) in my reducer to a common record. This works
> fine, but - if I have understood the mechanism correctly - it would mean that
> Avro is having to check each and every record against my UNION schema. With a
> "normal" reduce-side join, I could use MultipleInputs to specify a mapper for
> each input, thus letting them run independently (since each mapper knows its
> input) with presumably less overhead.
> Is it possible with Avro to avoid the overhead of checking each input row
> against the union schema?
>> From: Scott Carey <[EMAIL PROTECTED]>
>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Andrew Kenworthy
>> <[EMAIL PROTECTED]>
>> Sent: Wednesday, December 7, 2011 7:40 PM
>> Subject: Re: Reduce-side joins in Avro M/R
>> This should be conceptually the same as a normal map-reduce join of the same
>> type. Avro handles the serialization, but not the map-reduce algorithm or
>> On 12/6/11 8:43 AM, "Andrew Kenworthy" <[EMAIL PROTECTED]> wrote:
>>> I'd like to use reduce-side joins in an avro M/R job, and am not sure how to
>>> do it: are there any best-practice tips or outlines of what one would have
>>> to implement in order to make this possible?
>>> Andrew Kenworthy