The current HDFS's default replica placement policy don't fit two biased racks case very well: assume local rack has more nodes, which means more reducer slots and more disk capacity, then more reducer tasks will be executed within local rack. According to replica placement policy, it will put 1 replica on local rack and 2 replicas on remote rack which means data load are doubled in remote rack although less capacity there.
The workaround of cheating rack-aware script (like described below) may help to resolve unbalanced data problem but will take following two issues:
1. data reliability - all 3 replicas of some blocks may fall into the same "real" rack.
2. rack level data locality - no matter task scheduling or replica choosing in HDFS read will get mis-understand on real rack topology.
See if this is tradeoff you want to get in your case.
Another workaround, although not design for this case, may be helpful: to enable "NodeGroup" level of locality that between node and rack which is supported after 1.2.0. Nodes under the same "NodeGroup" can only have one replica placed which is designed for getting rid of replica duplicated for VMs on the same host. Specifically in your case, assume you have 20 machines in rack A and 10 machines in rack B, you can put rack A nodes to two NodeGroups (so each NodeGroup has 10 nodes) and rack B nodes to one NodeGroups. In this case, the replica will be distributed in ratio of 2:1, no matter where the writer is. Hope it helps.
----- Original Message -----
From: "Michael Segel" <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, October 3, 2013 8:23:58 PM
Subject: Re: rack awarness unexpected behaviour
The rack aware script is an artificial concept. Meaning you can tell which machine is in which rack and that may or may not reflect where the machine is actually located.
The idea is to balance the number of nodes in the racks, at least on paper. So you can have 14 machines in rack 1, and 16 machines in rack 2 even though they may physically be 20 machines in rack 1 and 10 machines in rack 2.
On Oct 3, 2013, at 2:52 AM, Marc Sturlese <[EMAIL PROTECTED]> wrote:
> I've check it out and it works like that. The problem is, if the two racks
> have not the same capacity, one will have the disk space filled up much
> faster than the other (that's what I'm seeing).
> If one rack (rack A) has 2 servers of 8 cores with 4 reduce slots each and
> the other rack (rack B) has 2 servers of 16 cores with 8 reduce slots each,
> rack A will get filled up faster as rack B is writing more (because has more
> reduce slots).
> Could a solution be to modify the bash script used to decide to which
> replica write a block? Would use probability and give to rack B double
> chance to receive de write.
> View this message in context: http://lucene.472066.n3.nabble.com/rack-awareness-unexpected-behaviour-tp4086029p4093270.html
> Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
michael_segel (AT) hotmail.com