Btw - I figured out the problem.
The jobconf from the remote client had the socks proxy configuration - the jvm spawned by TTs picked this up and tried to connect using the proxy which of course didn't work.
This was easy to solve - just had to make the remote initialization script mark hadoop.rpc.socket.factory.class.default as final variable in the hadoop-site.xml on server side.
I am assuming that this would be a good thing to do in general (can't believe why server side traffic would be routed through a proxy!).
Filed https://issues.apache.org/jira/browse/HADOOP-5839 to follow up on the issues uncovered here.
From: Tom White [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 14, 2009 7:07 AM
To: [EMAIL PROTECTED]
Subject: Re: public IP for datanode on EC2
Yes, you're absolutely right.
On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote:
> The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside).
> The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small.
> -----Original Message-----
> From: Tom White [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, May 14, 2009 5:58 AM
> To: [EMAIL PROTECTED]
> Subject: Re: public IP for datanode on EC2
> Hi Joydeep,
> The problem you are hitting may be because port 50001 isn't open,
> whereas from within the cluster any node may talk to any other node
> (because the security groups are set up to do this).
> However I'm not sure this is a good approach. Configuring Hadoop to
> use public IP addresses everywhere should work, but you have to pay
> for all data transfer between nodes (see http://aws.amazon.com/ec2/,
> "Public and Elastic IP Data Transfer"). This is going to get expensive
> So to get this to work well, we would have to make changes to Hadoop
> so it was aware of both public and private addresses, and use the
> appropriate one: clients would use the public address, while daemons
> would use the private address. I haven't looked at what it would take
> to do this or how invasive it would be.
> On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote:
>> I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).
>> Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:
>> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
>> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
>> java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
>> at org.apache.hadoop.ipc.Client.call(Client.java:699)
>> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
>> at $Proxy1.getProtocolVersion(Unknown Source)
>> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
>> at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
>> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
>> at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
>> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
>> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
>> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
>> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)