Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Problems adding JARs to distributed classpath in Hadoop 0.20.2


Copy link to this message
-
Re: Problems adding JARs to distributed classpath in Hadoop 0.20.2
John,

If you are using Oozie, dropping all the JARs your MR jobs needs in the
Oozie WF lib/ directory should suffice. Oozie will make sure all those JARs
are in the distributed cache.

Alejandro

On Thu, May 26, 2011 at 7:45 AM, John Armstrong <[EMAIL PROTECTED]>wrote:

> Hi, everybody.
>
> I'm running into some difficulties getting needed libraries to map/reduce
> tasks using the distributed cache.
>
> I'm using Hadoop 0.20.2, which from what I can tell is a hard requirement
> by the client, so more current versions are not really viable options.
>
> The code I've inherited is Java, which sets up and runs the MR job.
> There's currently some nontrivial pre- and post-processing, so it will be a
> large refactoring before I can just run bare MR jobs rather than starting
> them through Java.
>
> Further complicating matters: in practice the Java jobs are launched by
> Oozie, which of course does so by wrapping each one in a MR shell.  The
> upshot is that I don't have any control over which "local" filesystem the
> Java job is run from, though if local files are absolutely needed I can
> make my Java wrappers copy stuff back from HDFS to the Java job's local
> filesystem.
>
> So here's the problem
>
> mappers and/or reducers need class Needed, which is contained in
> needed-1.0.jar, which is in HDFS:
>    hdfs://.../libdir/distributed/needed-1.0.jar
>
> Java program executes:
>    DistributedCache.addFiletoClassPath(new
>
> Path("hdfs://.../libdir/distributed/needed-1.0.jar"),job.getConfiguration());
>
> Inspecting the Job object I find the file has been added to the cache
> files as expected:
>    job.conf.overlay[...] = mapred.cache.files ->
> hdfs://.../libdir/distributed/needed-1.0.jar
>    job.conf.properties[...] = mapred.cache.files ->
> hdfs://.../libdir/distributed/needed-1.0.jar
>
> And the class seems to show up in the internal ClassLoader:
>    job.conf.classLoader.classes[...] = "class my.class.package.Needed"
>
> though this may just be inherited from the ClassLoader of the Java process
> itself (which also uses Needed).
>
> And yet as soon as I get into the mapreduce job itself I start getting:
>
> 2011-05-25 17:22:56,080  INFO JobClient - Task Id :
> attempt_201105251330_0037_r_000043_0, Status : FAILED
> java.lang.RuntimeException: java.lang.ClassNotFoundException:
> my.class.package.Needed
>
> Up until this point we've run things by having a directory on each node
> containing all the libraries we'd need, and including that in the Hadoop
> classpath, but we have no such control in the deployment scenario, so we
> have to make our program hand the needed libraries to the map and reduce
> nodes via the distributed cache classpath.
>
> Thanks in advance for any insight or assistance you can offer.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB