I have been familiarizing myself with the Hadoop 0.23 code tree, and found myself wondering if people were aware of the tools commonly used by the HPC community as I worked my way thru the code. Just in case the community isn't, I thought it might be worth a very brief summary of the field.
The HPC community has a long history of implementing and fielding resource and application management tools that are fault tolerant with respect to process and node failures. These systems readily scale to the 30K node and 100K process size today, and are running on 10s of thousands of clusters around the world. Efforts to extend capabilities to the 100K node, 1M process level are aggressively being pursued and expected to be fielded in the next two years.
Although these systems support MPI applications, there is nothing exclusively MPI about them. All will readily execute any application, including a MapReduce operation. HDFS support is lacking, of course, but can be readily added in most cases.
My point here is that replicating all the fault tolerant resource and application management capabilities of the existing systems is a major undertaking that may not be necessary. The existing systems represent hundreds of man-years of effort, coupled with thousands of machine-years of operational experience - together, these provide a level of confidence that will be hard to duplicate.
In the HPC world, resource managers and application managers are typically separate entities. Although many people use the AM that comes with a particular RM, almost all RM/AM systems support a wide range of combinations so that users can pick-and-choose the pairing that best fits their needs. Both proprietary (e.g., Moab and LSF) and open-source (e.g., SLURM and Gridengine) versions of RMs and AMs are available, each with differing levels of fault tolerance.
For those not wanting to deal with the peculiarities of each combination, abstraction layers are available. For example, Open MPI's "mpirun" tool will transparently interface to nearly all available RMs and AMs, executing your application in the same manner regardless of the underlying systems. OMPI provides its own fault tolerance capabilities to ensure commonality across environments, and is capable of restarting and rewiring applications as required. In addition, OMPI allows the definition of "fault groups" (i.e., failure dependencies between nodes) to further protect against cascading failures, and a variety of software and hardware sensors to detect deteriorating behavior prior to actual node/process failure.
Multiple error response strategies are available from most existing systems, including OMPI. These are user-selectable at time of execution and range from simple termination of the entire application, to restarting processes on the current node (assuming that node is up or restarts), shifting processes to nodes currently being used by the application, and shifting processes to available "backup" nodes.
I can provide more info as requested, but wanted to at least make you aware of the existing capabilities. A quick "poll" of HPC RM/AM providers indicates a willingness to add HDFS support - they haven't done so to-date because nobody asked them to do so, and they tend to respond to user demand. No technical barrier is immediately apparent.