|
|
-
Questions about HDFS’s placement policy
Giovanni Marzulli 2012-03-14, 16:24
Hello,
I'm trying HDFS on a small test cluster and I need to clarify some doubts about hadoop behaviour.
Some details of my cluster: Hadoop version: 0.20.2 I have two racks (rack1, rack2). Three datanodes for every rack. Replication factor is set to 3.
"HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack." Instead, I noticed that sometimes, a few blocks of files are stored as follows: two replicas in the local rack and a replica in a different rack. Are there exceptions that cause different behaviour than default placement policy? Likewise, at times some blocks are read from nodes in the remote rack instead of nodes in the local rack. Why does it happen?
Another thing:if I have two datacenters and two racks for each of them (so a hierarchical network topology), where tworemote replicas arestored? Does Hadoop consider the hierarchy and stores one replica in the local datacenter and two replicas in the other datacenter? Or the two replicas are stored in a totally random rack?
Thanks Gianni
-
Re: Questions about HDFS’s placement policy
Suresh Srinivas 2012-03-14, 23:14
See my comments inline:
On Wed, Mar 14, 2012 at 9:24 AM, Giovanni Marzulli < [EMAIL PROTECTED]> wrote:
> Hello, > > I'm trying HDFS on a small test cluster and I need to clarify some doubts > about hadoop behaviour. > > Some details of my cluster: > Hadoop version: 0.20.2 > I have two racks (rack1, rack2). Three datanodes for every rack. > Replication factor is set to 3. > > "HDFS’s placement policy is to put one replica on one node in the local > rack, another on a node in a different (remote) rack, and the last on a > different node in the same remote rack." > Instead, I noticed that sometimes, a few blocks of files are stored as > follows: two replicas in the local rack and a replica in a different rack. Are > there exceptions that cause different behaviour than default placement > policy? >
Your description of replica placement is correct. However a node chosen based on this placement may not be a good target, due to the traffic on the node, remaining space etc. See BlockPlacementPolicyDefault#isGoodTarget(). Given the small cluster size, you may be seeing different behavior based on load of individual nodes, racks etc.
Likewise, at times some blocks are read from nodes in the remote rack > instead of nodes in the local rack. Why does it happen? >
This is surprising. Not sure if the topology is correctly configired. > Another thing: if I have two datacenters and two racks for each of them > (so a hierarchical network topology), where two remote replicas arestored? Does Hadoop consider the hierarchy and stores one replica in the > local datacenter and two replicas in the other datacenter? Or the two > replicas are stored in a totally random rack? > > Hadoop clusters are not spread across datacenters.
Regards, Suresh
-
Re: Questions about HDFS's placement policy
Giovanni Marzulli 2012-03-16, 13:29
Il 15/03/2012 00:14, Suresh Srinivas ha scritto: > See my comments inline: > > On Wed, Mar 14, 2012 at 9:24 AM, Giovanni Marzulli > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > wrote: > > Hello, > > I'm trying HDFS on a small test cluster and I need to clarify some > doubts about hadoop behaviour. > > Some details of my cluster: > Hadoop version: 0.20.2 > I have two racks (rack1, rack2). Three datanodes for every rack. > Replication factor is set to 3. > > "HDFS�s placement policy is to put one replica on one node in the > local rack, another on a node in a different (remote) rack, and > the last on a different node in the same remote rack." > Instead, I noticed that sometimes, a few blocks of files are > stored as follows: two replicas in the local rack and a replica in > a different rack. Are there exceptions that cause different > behaviour than default placement policy? > > > Your description of replica placement is correct. However a node > chosen based on this placement may not be a good target, due to the > traffic on the node, remaining space etc. See > BlockPlacementPolicyDefault#isGoodTarget(). Given the small cluster > size, you may be seeing different behavior based on load of individual > nodes, racks etc. > > Likewise, at times some blocks are read from nodes in the remote > rack instead of nodes in the local rack. Why does it happen? > > > This is surprising. Not sure if the topology is correctly configired. > > > Another thing:if I have two datacenters and two racks for each of > them (so a hierarchical network topology), where tworemote > replicas arestored? Does Hadoop consider the hierarchy and stores > one replica in the local datacenter and two replicas in the other > datacenter? Or the two replicas are stored in a totally random rack? > > Hadoop clusters are not spread across datacenters. When I speak of datacenters, do just an example. I reformulate the question. If I have this network topology: /rackA/rack1 /rackA/rack2
/rackB/rack3 /rackB/rack4
and I write a file from a node in the rack2 (rackA). The first replica will store on rack2; and where the others two replicas will be stored? rackA, rackB or random rack? So, which is the placement policy in a hierarchical network topology? > > Regards, > Suresh >
-
Re: Questions about HDFS's placement policy
palmercliff@...) 2012-03-16, 14:23
I recommend that you test your rack identification script, and test it under load. We encountered similar, seemingly random placement of files by HDFS and tracked the cause to this script. I hope this helps.
Sent from the desk of an overwhelmed engineer
-----Original message----- From: Giovanni Marzulli <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Fri, Mar 16, 2012 09:29:54 EDT Subject: Re: Questions about HDFS's placement policy
Il 15/03/2012 00:14, Suresh Srinivas ha scritto: > See my comments inline: > > On Wed, Mar 14, 2012 at 9:24 AM, Giovanni Marzulli > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > wrote: > > Hello, > > I'm trying HDFS on a small test cluster and I need to clarify some > doubts about hadoop behaviour. > > Some details of my cluster: > Hadoop version: 0.20.2 > I have two racks (rack1, rack2). Three datanodes for every rack. > Replication factor is set to 3. > > "HDFS’s placement policy is to put one replica on one node in the > local rack, another on a node in a different (remote) rack, and > the last on a different node in the same remote rack." > Instead, I noticed that sometimes, a few blocks of files are > stored as follows: two replicas in the local rack and a replica in > a different rack. Are there exceptions that cause different > behaviour than default placement policy? > > > Your description of replica placement is correct. However a node > chosen based on this placement may not be a good target, due to the > traffic on the node, remaining space etc. See > BlockPlacementPolicyDefault#isGoodTarget(). Given the small cluster > size, you may be seeing different behavior based on load of individual > nodes, racks etc. > > Likewise, at times some blocks are read from nodes in the remote > rack instead of nodes in the local rack. Why does it happen? > > > This is surprising. Not s
-
Re: Questions about HDFS's placement policy
Joey Echeverria 2012-03-16, 16:05
The current replica placement policy is not aware of multiple levels in your topology. So, in your example, it would pick any of the other three racks: /rackA/rack1, /rackB/rack3, or /rackB/rack4 with equal probability.
The only way to get the behavior you desire is to specify only one level of racks (/rackA and /rackB).
-Joey
Sent from my iPhone
On Mar 16, 2012, at 9:29, Giovanni Marzulli <[EMAIL PROTECTED]> wrote:
> Il 15/03/2012 00:14, Suresh Srinivas ha scritto: >> >> See my comments inline: >> >> On Wed, Mar 14, 2012 at 9:24 AM, Giovanni Marzulli <[EMAIL PROTECTED]> wrote: >> Hello, >> >> I'm trying HDFS on a small test cluster and I need to clarify some doubts about hadoop behaviour. >> >> Some details of my cluster: >> Hadoop version: 0.20.2 >> I have two racks (rack1, rack2). Three datanodes for every rack. >> Replication factor is set to 3. >> >> "HDFS’s placement policy is to put one replica on one node in the local rack, another on a node in a different (remote) rack, and the last on a different node in the same remote rack." >> Instead, I noticed that sometimes, a few blocks of files are stored as follows: two replicas in the local rack and a replica in a different rack. Are there exceptions that cause different behaviour than default placement policy? >> >> Your description of replica placement is correct. However a node chosen based on this placement may not be a good target, due to the traffic on the node, remaining space etc. See BlockPlacementPolicyDefault#isGoodTarget(). Given the small cluster size, you may be seeing different behavior based on load of individual nodes, racks etc. >> >> Likewise, at times some blocks are read from nodes in the remote rack instead of nodes in the local rack. Why does it happen? >> >> This is surprising. Not sure if the topology is correctly configired. >> >> >> Another thing: if I have two datacenters and two racks for each of them (so a hierarchical network topology), where two remote replicas are stored? Does Hadoop consider the hierarchy and stores one replica in the local datacenter and two replicas in the other datacenter? Or the two replicas are stored in a totally random rack? >> >> Hadoop clusters are not spread across datacenters. > When I speak of datacenters, do just an example. I reformulate the question. > If I have this network topology: > /rackA/rack1 > /rackA/rack2 > > /rackB/rack3 > /rackB/rack4 > > and I write a file from a node in the rack2 (rackA). The first replica will store on rack2; and where the others two replicas will be stored? rackA, rackB or random rack? So, which is the placement policy in a hierarchical network topology? >> >> Regards, >> Suresh >> >
|
|