> I found a way:
> 1) Configure second datanode with a set of fresh empty directories.
> 2) Start second datanode, let it register with namenode.
> 3) Shut down first and second datanode, then move blk* and subdir dirs
> from data dirs of first node to data dirs of second datanode.
> 4) Start first and second datanode.
> This seems to work as intended. However, after some thinking I came to
> worry about the replication. HDFS will now consider the two datanode
> instances on the same host as two different hosts, which may cause
> replication to put two copies of the same file on the same host.
> It's probably not going to happen very often given that there's some
> randomness involved. And in my case there's always a third copy on
> another rack.
> Still, it's less than optimal. Are there any ways to fool HDFS into
> always placing all copies on different physical hosts in this rather
> messed up configuration?
This is the same issue as for running multiple virtual machines on each physical host. I've found (on 0.20.2) that this gives consistently better performance than a single VM or a native OS instance (http://www.vmware.com/resources/techresources/10222), at least for I/O-intensive apps. I'm still investigating why, but one possibility is that a datanode can't efficiently handle too many disks (I have either 10 or 12 per physical host). So I'm very interested in seeing if multiple datanodes has a similar performance effect as multiple VMs (each with one DN).
Back to replication: Hadoop doesn't know that the machines it's running on might share a physical host, so there is a possibility that 2 copies end up on the same host. What I'm doing now is define each host as a rack, so the second copy is guaranteed to go to a different host. I have a single physical rack. I'm tempted to call physical racks "super racks" to distinguish them from logical racks. A better scheme may be to divide the physical rack into 2 logical racks, so that most of the time the third copy goes on a different host than the second. I think that is the best that can be done today. Ideally we want to modify the block placement algorithm to recognize another level in the topology hierarchy for the multiple VM/DN case. A simpler solution would be to add an option where the third copy is placed in a third rack when available (and extended to n replicas on n racks instead of random placement for n>3). This would work for the single physical rack case with each host defined as a rack for the topology. Placing replicas on separate racks may be desirable for some conventional configurations also (e.g., ones with good inter-rack bandwidth).