Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> how does hdfs determine what node to use?


Copy link to this message
-
RE: how does hdfs determine what node to use?
Rita said that she has 2 racks (not 2 nodes).  Rita, how many nodes per rack do you have?

To continue the thread, could there be a performance advantage to having greater replication in the shuffle or reduce phases?  That is, is hadoop smart enough that when it needs data that are not on the local node, it finds out which copy of that data is on the closest (in the network sense) node and gets it from there?  Or (if the copies are on the same rack) from the node with the least traffic on it currently?  As opposed to always getting it from the node with the "primary" copy.

Jeff

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
Sent: Thursday, March 10, 2011 10:19 AM
To: [EMAIL PROTECTED]
Subject: Re: how does hdfs determine what node to use?

Actually I just meant to point out however many copies you have, the copies are placed on different nodes. Although if you only have two nodes, there aren't a whole lot of options.. :)

I thought Rita was mainly worried if they all went to the same node - which would be bad.

Take care,
-stu
________________________________
From: Ayon Sinha <[EMAIL PROTECTED]>
Date: Thu, 10 Mar 2011 07:41:17 -0800 (PST)
To: <[EMAIL PROTECTED]>
ReplyTo: [EMAIL PROTECTED]
Subject: Re: how does hdfs determine what node to use?

I think Stu meant that each block will have a copy on at most 2 nodes.
Before Hadoop .20 rack awareness was not built-in the algo to pick the replication node. With .20 and later, the rack awareness does the following:
1. First copy of the block is picked at "random" from one of the least loaded nodes. Then the next copy is picked to be on another node on the same rack (to save network hops).
2. Then if the rep factor is 3, it will pick another node from another rack. This is done to provide redundancy in case an entire rack is unavailable due to switch failure.

So I am guessing if you have a rep factor of 2, both the blocks will be on the same rack. Its quite possible that Hadoop has some switch somewhere to change this policy, because Hadoop has a switch for everything.

-Ayon

________________________________
From: Rita <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Thu, March 10, 2011 5:37:08 AM
Subject: Re: how does hdfs determine what node to use?

Thanks Stu. I too was sure there was an algorithm. Is there a place where I can read more about it?  I want to know if it picks a block according to the load average or does it always pick "rack0" first?
On Wed, Mar 9, 2011 at 10:24 PM, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
There is an algorithm. Each block should have a copy on different nodes. In your case, each block will have a copy on each of the nodes.

Take care,
-stu
________________________________
From: Rita <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: Wed, 9 Mar 2011 22:07:37 -0500
To: <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
ReplyTo: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: how does hdfs determine what node to use?

I have a 2 rack cluster. All of my files have a replication factor of 2. How does hdfs determine what node to use when serving the data? Does it always use the first rack? or is there an algorithm for this?
--
--- Get your facts first, then you can distort them as you please.--

--
--- Get your facts first, then you can distort them as you please.--

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB