Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problems adding JARs to distributed classpath in Hadoop 0.20.2


Copy link to this message
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2
John,

If you are using Oozie, dropping all the JARs your MR jobs needs in the
Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs
are in the distributed cache.

Alejandro

On Thu, May 26, 2011 at 7:45 AM, John Armstrong <[EMAIL PROTECTED]>wrote:

> Hi, everybody.
>
> I'm running into some difficulties getting needed libraries to map/reduce
> tasks using the distributed cache.
>
> I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement
> by the client, so more current versions are not really viable options.
>
> The code I've inherited is Java, which sets up and runs the MR job.
> There's currently some nontrivial pre- and post-processing, so it will be a
> large refactoring before I can just run bare MR jobs rather than starting
> them through Java.
>
> Further complicating matters: in practice the Java jobs are launched by
> Oozie, which of course does so by wrapping each one in a MR shell.  The
> upshot is that I don't have any control over which "local" filesystem the
> Java job is run from, though if local files are absolutely needed I can
> make my Java wrappers copy stuff back from HDFS to the Java job's local
> filesystem.
>
> So here's the problem
>
> mappers and/or reducers need class Needed, which is contained in
> needed-1.0.jar, which is in HDFS:
>    hdfs://.../libdir/distributed/needed-1.0.jar
>
> Java program executes:
>    DistributedCache.addFiletoClassPath(new
>
> Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration());
>
> Inspecting the Job object I find the file has been added to the cache
> files as expected:
>    job.conf.overlay[...] = mapred.cache.files ->
> hdfs://.../libdir/distributed/needed-1.0.jar
>    job.conf.properties[...] = mapred.cache.files ->
> hdfs://.../libdir/distributed/needed-1.0.jar
>
> And the class seems to show up in the internal ClassLoader:
>    job.conf.classLoader.classes[...] = "class my.class.package.Needed"
>
> though this may just be inherited from the ClassLoader of the Java process
> itself (which also uses Needed).
>
> And yet as soon as I get into the mapreduce job itself I start getting:
>
> 2011-05-25 17:22:56,080  INFO JobClient - Task Id :
> attempt_201105251330_0037_r_000043_0, Status : FAILED
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> my.class.package.Needed
>
> Up until this point we've run things by having a directory on each node
> containing all the libraries we'd need, and including that in the Hadoop
> classpath, but we have no such control in the deployment scenario, so we
> have to make our program hand the needed libraries to the map and reduce
> nodes via the distributed cache classpath.
>
> Thanks in advance for any insight or assistance you can offer.
>