On Jun 9, 2010, at 10:13 PM, Sean Bigdatafun wrote:
> I have two questions here about a HDFS cell. Suppose the file that I am interested is stored on 3 datanodes A, B, C. And A suddenly crashed, I understand I can still read my file because I have two copies available at this moment. But my question is which software module is responsible to bring A back to running? (is there a watchdog server?)
No, there is not a watchdog. Each installation is slightly different and (almost) every OS provides facilities to guarantee a daemon is continually running. [SMF, launchd, daemontools, etc.]. In most installations, I suspect wetware is used to bring back dead datanode processes so that the reason of the crash can be investigated.
> Furthermore, if the disk on server A is totally corrupted (disk failure), what should I do to bring my file back to 3 replication mode?
Fix the disk on A and restart the datanode process.
When you have more than 3 datanodes, the namenode will automatically replicate any under-replicated blocks if there is a node that is qualified to do so. [In other words, if you have a grid large enough to support topology, the namenode will not violate topology just to replicate a block. It is expected that there are enough nodes in enough racks to not cause policy violations.]