In my fix, I no longer call the method Utils.shipToHDFS, which writes to a temporary directory. Instead, I write to the staging directory for the job using the remote FS, in the same manner that Daniel's original code writes the job.jar to the staging directory. All the resources will now be in the staging directory for the job (on the remote FS). This should be much better.

I believe I have this corrected now. Although for SHIP, the documentation seems to assume that the "from" filesystem is always the local filesystem, i.e. SHIP('/foo/bar') means /foo/bar is local somehow (even if it's perhaps a mounted directory into some other kind of filesystem).

So just generally, there are two annoying problems that I think are the source of the very valid issues you raised:

1. You can't use java.net.URL with schemes like s3://, hdfs://, etc. These result in exceptions.

2. You can't do new Path(filepath).toUri().toURL(), unless the string filepath already has a scheme at the front of it supported by java.net.URL. For example, if filepath is "file://foo/bar" this will work, but if filepath is "/foo/bar" you will get an exception about not having an absolute URI.

I think there are basically two ways to get around these problems:

1. Instead of java.net.URL's, identify resources with YARN url's (which can have any scheme) and do appropriate conversions everywhere. This is the approach I took for my original implementation of STREAM. I identified SHIP resources with YARN url's with a file:// scheme and CACHE resources already present in the remote FS with an hdfs:// scheme. Rohini, as you pointed out, maybe this is not such a good way.

2. Change the resources data structure from Map<URL, Path> to be Map<String, Path> where the string is the resource name. In the end, this is basically what we give to the AM anyways. The resource names become symlinks in container's working directory to the file pointed out by the Path (in the remote FS).

In this second approach, if you SHIP('/foo/bar'), I first copy bar into the staging directory for the job in the remote FS, and then map "bar" -> the path in the remote FS. If you CACHE('/foo/bar#fragment'), I just map "fragment" -> /foo/bar, which should already exist in the remote FS.

In this patch, I changed to the second approach.
> On Dec. 26, 2013, 10:08 a.m., Rohini Palaniswamy wrote:

I removed the YARN url's and now use java.net.URL only to identify files to ship from the local filesystem to the remote FS.
- Alex
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/16309/#review30868
On Dec. 23, 2013, 5:34 p.m., Alex Bain wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB