Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> How does a ReduceTask determine which MapTask output to read?


Copy link to this message
-
Re: How does a ReduceTask determine which MapTask output to read?
On 06/29/2011 05:28 PM, Virajith Jalaparti wrote:
> Hi,
>
> I was wondering what scheduling algorithm is used in Hadoop (version
> 0.20.2 in particular), for a ReduceTask to determine in what order it is
> supposed to read the map outputs from the various mappers that have been
> run? In particular, suppose we have 10maps called map1, map2,....,
> map10. and say 2 reducers r1 and r2. Which map's output does r1/r2 read
> from first?
>
> Also, suppose that the mapred.reduce.parallel.copies is set to 5. Then
> do both r1 and r2 read from 5 map outputs concurrently?
>
> Thanks,
> Virajith

You're missing 2 key steps in here.  After the mappers, a sort step gets
run (to sort the records in key order) and then a partition step (to
partition the records by key and spread them across the reducers).

So your question is really a moot one.  The records output by a given
map step get spread across multiple reducers, and not all sent to a
single reducer.

DR
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB