-Re: How does a ReduceTask determine which MapTask output to read?
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
> I was wondering what scheduling algorithm is used in Hadoop (version
> 0.20.2 in particular), for a ReduceTask to determine in what order it is
> supposed to read the map outputs from the various mappers that have been
> run? In particular, suppose we have 10maps called map1, map2,....,
> map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
> from first?
> Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
> do both r1 and r2 read from 5 map outputs concurrently?
You're missing 2 key steps in here. After the mappers, a sort step gets
run (to sort the records in key order) and then a partition step (to
partition the records by key and spread them across the reducers).
So your question is really a moot one. The records output by a given
map step get spread across multiple reducers, and not all sent to a