In dealing with some comments Konstantin had about my submitted
CephFileSystem (HADOOP-6253), I discovered a potential bug in my
implementation, and I'm not sure either how to handle it or how much
handling is necessary.
At present (and probably for the foreseeable future) the Ceph
filesystem deals only in IP addresses, not host names; the
CephFileSystem implementation simply takes the IP addresses it is
given and stuffs them into the hostname field. The BlockLocation
format, meanwhile, apparently expects to get actual host names (ie,
server.internal-subnet.company.com). As best I can tell from a bit of
googling, these are *mostly* just resolved to IP addresses, but aren't
My understanding is that these BlockLocations are then used to
strategically place computation on the nodes already hosting the
related files, reducing network usage.
I expect that, in the case of an unconventional network setup (eg
http://markmail.org/message/an4dh7va2is3iq3h) this could cause a drop
in performance (as my returned IP addresses might not match the known
IP for a given multi-IP host), but I am curious if returning IP
addresses will cause *all* comparisons to fail, or if there are any
other potential problems I need to be aware of.
If it does require the actual known hostname rather than the IP
address, I'm curious if there are any suggestions on how to gather
these -- from some archived email messages I've gathered that Hadoop
actually gets its known hostnames from the config files?
Also, I'm curious why there is both a "hosts" field (expecting
hostname) and "name" field (expecting hostname:port). I presume that
HDFS has some internal hacks it uses to start communicating
faster/more efficiently based on the given port -- expecting Hadoop to
communicate directly over a third-party filesystem's port seems a bit