Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - TaskTrackers behind NAT


Copy link to this message
-
Re: TaskTrackers behind NAT
Allen Wittenauer 2011-07-19, 00:24

On Jul 18, 2011, at 12:53 PM, Ben Clay wrote:

> I'd like to spread Hadoop across two physical clusters, one which is
> publicly accessible and the other which is behind a NAT. The NAT'd machines
> will only run TaskTrackers, not HDFS, and not Reducers either (configured
> with 0 Reduce slots).  The master node will run in the publicly-available
> cluster.

Off the top, I doubt it will work : MR is bi-directional, across many random ports.  So I would suspect there is going to be a lot of hackiness in the network config to make this work.

> 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks
> fetch intermediate data from http://
> <http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput,
> correct ?  I'm getting "Too many fetch-failures" with no open ports, so I
> assume the Reduce tasks need to pull the intermediate data instead of Map
> tasks pushing it.

Correct. Reduce tasks pull.

> 2. Although the NAT'd machines have unique IPs and reach the outside, the
> DHCP is not assigning them hostnames.  Therefore, when they join the
> JobTracker I get
> "tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the
> machine list page.  Is there some way to force Hadoop to refer to them via
> IP instead of hostname, since I don't have control over the DHCP? I could
> manually assign a hostname via /etc/hosts on each NAT'd machine, but these
> are actually VMs and I will have many of them receiving semi-random IPs,
> making this an ugly administrative task.
Short answer: no.

Long answer: no, fix your DHCP and/or do the /etc/hosts hack.