I've got a tar.gz file that has many 3rd party jars in it that my MR job
requires. This tar.gz file is located on hdfs. When configuring my MR
job, I call DistributedCache.addArchiveToClassPath(), passing in the hdfs
path to the tar.gz file. When the Mapper executes I get a
ClassNotFoundException because the Mapper process can't find one of the
jars, but the jar was in the tar.gz archive file I've added to the class
path via the DistributedCache.
I looked at the TaskTracker logs and saw entries that the tar.gz file was
extracted (see below), and when I look at the extraction folder, I see the
individual jar files.
I looked in the hadoop source, and the TaskDistributedCacheManager class
takes the path to where the archive was unpacked, and passes it to
DistributedCache.addLocalArchives. I assume that in later processing this
path is pulled from the configuration object and adds the path to the class
path for the mapper process.
So on the surface it looks like everything is correct. The tar.gz file is
passed to the task tracker, un-packaged, and the folder it is un-packaged
into is passed into the task configuration object. But the Mapper still
can't find the jars it needs.
Also, I invoke my MR job programmatically with Job.waitForCompletion, so
using the -libjars arg from the cmd line isn't an option here. And I'd
really rather not create unpackaged jars that have all dependent jars
unpackaged in them.
Any idea what I'd doing wrong with passing an archive file into distributed
cache to be placed in the class path?
INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Creating
INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Extracting
INFO org.apache.hadoop.filecache.TrackerDistributedCacheManager: Cached