On Jul 31, 2011, at 7:30 PM, Saqib Jang -- Margalla Communications wrote:
> Thanks, I'm independently doing some digging into Hadoop networking
> requirements and
> had a couple of quick follow-ups. Could I have some specific info on why
> different data centers
> cannot be supported for master node and data node comms?
> Also, what
> may be the benefits/use cases for such a scenario?
Most people who try to put the NN and DNs in different data centers are trying to achieve disaster recovery: one file system in multiple locations. That isn't the way HDFS is designed and it will end in tears. There are multiple problems:
1) no guarantee that one block replica will be each data center (thereby defeating the whole purpose!)
2) assuming one can work out problem 1, during a network break, the NN will lose contact from one half of the DNs, causing a massive network replication storm
3) if one using MR on top of this HDFS, the shuffle will likely kill the network in between (making MR performance pretty dreadful) is going to cause delays for the DN heartbeats
4) I don't even want to think about rebalancing.
... and I'm sure a lot of other problems I'm forgetting at the moment. So don't do it.
If you want disaster recovery, set up two completely separate HDFSes and run everything in parallel.