-Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2
If you are using Oozie, dropping all the JARs your MR jobs needs in the
Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs
are in the distributed cache.
On Thu, May 26, 2011 at 7:45 AM, John Armstrong <[EMAIL PROTECTED]>wrote:
> Hi, everybody.
> I'm running into some difficulties getting needed libraries to map/reduce
> tasks using the distributed cache.
> I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement
> by the client, so more current versions are not really viable options.
> The code I've inherited is Java, which sets up and runs the MR job.
> There's currently some nontrivial pre- and post-processing, so it will be a
> large refactoring before I can just run bare MR jobs rather than starting
> them through Java.
> Further complicating matters: in practice the Java jobs are launched by
> Oozie, which of course does so by wrapping each one in a MR shell. The
> upshot is that I don't have any control over which "local" filesystem the
> Java job is run from, though if local files are absolutely needed I can
> make my Java wrappers copy stuff back from HDFS to the Java job's local
> So here's the problem
> mappers and/or reducers need class Needed, which is contained in
> needed-1.0.jar, which is in HDFS:
> Java program executes:
> Inspecting the Job object I find the file has been added to the cache
> files as expected:
> job.conf.overlay[...] = mapred.cache.files ->
> job.conf.properties[...] = mapred.cache.files ->
> And the class seems to show up in the internal ClassLoader:
> job.conf.classLoader.classes[...] = "class my.class.package.Needed"
> though this may just be inherited from the ClassLoader of the Java process
> itself (which also uses Needed).
> And yet as soon as I get into the mapreduce job itself I start getting:
> 2011-05-25 17:22:56,080 INFO JobClient - Task Id :
> attempt_201105251330_0037_r_000043_0, Status : FAILED
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> Up until this point we've run things by having a directory on each node
> containing all the libraries we'd need, and including that in the Hadoop
> classpath, but we have no such control in the deployment scenario, so we
> have to make our program hand the needed libraries to the map and reduce
> nodes via the distributed cache classpath.
> Thanks in advance for any insight or assistance you can offer.