Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - How does a ReduceTask determine which MapTask output to read?


Copy link to this message
-
Re: How does a ReduceTask determine which MapTask output to read?
David Rosenstrauch 2011-06-29, 22:37
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
> Hi,
>
> I was wondering what scheduling algorithm is used in Hadoop (version
> 0.20.2 in particular), for a ReduceTask to determine in what order it is
> supposed to read the map outputs from the various mappers that have been
> run? In particular, suppose we have 10maps called map1, map2,....,
> map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
> from first?
>
> Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
> do both r1 and r2 read from 5 map outputs concurrently?
>
> Thanks,
> Virajith

You're missing 2 key steps in here.  After the mappers, a sort step gets
run (to sort the records in key order) and then a partition step (to
partition the records by key and spread them across the reducers).

So your question is really a moot one.  The records output by a given
map step get spread across multiple reducers, and not all sent to a
single reducer.

DR