Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Chaning Multiple Reducers: Reduce -> Reduce -> Reduce


Copy link to this message
-
Re: Chaning Multiple Reducers: Reduce -> Reduce -> Reduce
Jim Twensky 2012-10-05, 18:02
Hi Harsh,

The hidden map operation which is applied to the reduced partition at
one stage can generate keys that are outside of the range covered by
that particular reducer. I still need to have the many-to-many
communication from reduce step k to reduce step k+1. Otherwise, I
think the ChainReducer would do the job and apply multiple maps to
each isolated partition produced by the reducer.

Jim

On Fri, Oct 5, 2012 at 12:54 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Would it then be right to assume that the keys produced by the reduced
> partition at one stage would be isolated to its partition alone and
> not occur in any of the other partition outputs? I'm guessing not,
> based on the nature of your data?
>
> I'm trying to understand why shuffling is good to be avoided here, and
> if it can be in some ways, given the data. As I see it, you need
> re-sort based on the new key per partition, but not the shuffle? Or am
> I wrong?
>
> On Fri, Oct 5, 2012 at 11:13 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
>> Hi Harsh,
>>
>> Yes, there is actually a "hidden" map stage, that generates new
>> <key,value> pairs based on the last reduce output but I can create
>> those records during the reduce step instead and get rid of the
>> intermediate map computation completely. The idea is to apply the map
>> function to each output of the reduce inside the reduce class and emit
>> the result as the output of the reducer.
>>
>> Jim
>>
>> On Fri, Oct 5, 2012 at 12:18 PM, Harsh J <[EMAIL PROTECTED]> wrote:
>>> Hey Jim,
>>>
>>> Are you looking to re-sort or re-partition your data by a different
>>> key or key combo after each output from reduce?
>>>
>>> On Fri, Oct 5, 2012 at 10:01 PM, Jim Twensky <[EMAIL PROTECTED]> wrote:
>>>> Hi,
>>>>
>>>> I have a complex Hadoop job that iterates over  large graph data
>>>> multiple times until some convergence condition is met. I know that
>>>> the map output goes to the local disk of each particular mapper first,
>>>> and then fetched by the reducers before the reduce tasks start. I can
>>>> see that this is an overhead, and it theory we can ship the data
>>>> directly from mappers to reducers, without serializing on the local
>>>> disk first. I understand that this step is necessary for fault
>>>> tolerance and it is an essential building block of MapReduce.
>>>>
>>>> In my application, the map process consists of identity mappers which
>>>> read the input from HDFS and ship it to reducers. Essentially, what I
>>>> am doing is applying chains of reduce jobs until the algorithm
>>>> converges. My question is, can I bypass the serialization of the local
>>>> data and ship it from mappers to reducers immediately (as soon as I
>>>> call context.write() in my mapper class)? If not, are there any other
>>>> MR platforms that can do this? I've been searching around and couldn't
>>>> see anything similar to what I need. Hadoop On Line is a prototype and
>>>> has some similar functionality but it hasn't been updated for a while.
>>>>
>>>> Note: I know about ChainMapper and ChainReducer classes but I don't
>>>> want to chain multiple mappers in the same local node. I want to chain
>>>> multiple reduce functions globally so the data flow looks like: Map ->
>>>> Reduce -> Reduce -> Reduce, which means each reduce operation is
>>>> followed by a shuffle and sort essentially bypassing the map
>>>> operation.
>>>
>>>
>>>
>>> --
>>> Harsh J
>
>
>
> --
> Harsh J