The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside).
The only data transfer that I would incur while submitting jobs from outside is the cost of copying the jar files and any other files meant for the distributed cache). That would be extremely small.
From: Tom White [mailto:[EMAIL PROTECTED]]
Sent: Thursday, May 14, 2009 5:58 AM
To: [EMAIL PROTECTED]
Subject: Re: public IP for datanode on EC2
The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).
However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses everywhere should work, but you have to pay
for all data transfer between nodes (see http://aws.amazon.com/ec2/,
"Public and Elastic IP Data Transfer"). This is going to get expensive
So to get this to work well, we would have to make changes to Hadoop
so it was aware of both public and private addresses, and use the
appropriate one: clients would use the public address, while daemons
would use the private address. I haven't looked at what it would take
to do this or how invasive it would be.
On Thu, May 14, 2009 at 1:37 PM, Joydeep Sen Sarma <[EMAIL PROTECTED]> wrote:
> I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname).
> Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception:
> 2009-05-14 07:30:34,913 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001. Already tried 9 time(s).
> 2009-05-14 07:30:34,914 WARN org.apache.hadoop.mapred.TaskTracker: Error running child
> java.io.IOException: Call to ec2-75-101-199-45.compute-1.amazonaws.com/10.254.175.132:50001 failed on local exception: Connection refused
> at org.apache.hadoop.ipc.Client.call(Client.java:699)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
> at $Proxy1.getProtocolVersion(Unknown Source)
> at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:319)
> at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:104)
> at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:177)
> at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:74)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1367)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:56)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1379)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:215)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:120)
> at org.apache.hadoop.mapred.Child.main(Child.java:153)
> strangely enough - job submissions from nodes within the ec2 cluster work just fine. I looked at the job.xml files of jobs submitted locally and remotely and don't see any relevant differences.
> Totally foxed now.
> -----Original Message-----
> From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, May 13, 2009 9:38 PM
> To: [EMAIL PROTECTED]
> Cc: Tom White
> Subject: RE: public IP for datanode on EC2
> Thanks Philip. Very helpful (and great blog post)! This seems to make basic dfs command line operations work just fine.
> However - I am hitting a new error during job submission (running hadoop-0.19.0):
> 2009-05-14 00:15:34,430 ERROR exec.ExecDriver (SessionState.java:printError(279)) - Job Submission failed with exception 'java.net.UnknownHostException(unknown host: domU-12-31-39-00-51-94.compute-1.internal)'
> java.net.UnknownHostException: unknown host: domU-12-31-39-00-51-94.compute-1.internal