Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> how does hdfs determine what node to use?

Copy link to this message
Re: how does hdfs determine what node to use?

On Mar 10, 2011, at 10:34 AM, Jeffrey Buell wrote:

> Rita said that she has 2 racks (not 2 nodes).  Rita, how many nodes per rack do you have?
> To continue the thread, could there be a performance advantage to having greater replication in the shuffle or reduce phases?  That is, is hadoop smart enough that when it needs data that are not on the local node, it finds out which copy of that data is on the closest (in the network sense) node and gets it from there?  

The reduce phase doesn't read from HDFS.   It does the equiv. of a  HTTP get from the tasktracker that hold the map's intermediate output.  The speed up here is that the reduce should get scheduled on the same node that one of the job's mapper tasks was scheduled, especially any hosts that have significant map output.  This could potentially reduce network usage, but in the end is likely to be insignificant.