On Jul 18, 2011, at 12:53 PM, Ben Clay wrote:
> I'd like to spread Hadoop across two physical clusters, one which is
> publicly accessible and the other which is behind a NAT. The NAT'd machines
> will only run TaskTrackers, not HDFS, and not Reducers either (configured
> with 0 Reduce slots). The master node will run in the publicly-available
Off the top, I doubt it will work : MR is bi-directional, across many random ports. So I would suspect there is going to be a lot of hackiness in the network config to make this work.
> 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks
> fetch intermediate data from http://
> <http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput,
> correct ? I'm getting "Too many fetch-failures" with no open ports, so I
> assume the Reduce tasks need to pull the intermediate data instead of Map
> tasks pushing it.
Correct. Reduce tasks pull.
> 2. Although the NAT'd machines have unique IPs and reach the outside, the
> DHCP is not assigning them hostnames. Therefore, when they join the
> JobTracker I get
> "tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the
> machine list page. Is there some way to force Hadoop to refer to them via
> IP instead of hostname, since I don't have control over the DHCP? I could
> manually assign a hostname via /etc/hosts on each NAT'd machine, but these
> are actually VMs and I will have many of them receiving semi-random IPs,
> making this an ugly administrative task.
Short answer: no.
Long answer: no, fix your DHCP and/or do the /etc/hosts hack.