Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> streaming cacheArchive shared libraries

Copy link to this message
Re: streaming cacheArchive shared libraries
Hi Keith,

I have tried the exact use case you have mentioned and it works fine for me.
Below is the command line for the same:

[ramya]$ jar vxf samplelib.jar
 created: META-INF/
 inflated: libhdfs.so

[ramya]$ hadoop dfs -put samplelib.jar samplelib.jar

[ramya]$ hadoop jar hadoop-streaming.jar -input InputDir -mapper "ls
testlink/libhdfs.so" -reducer NONE -output out -cacheArchive

[ramya]$ hadoop dfs -cat out/*
Hope it helps.


On 8/5/11 10:10 AM, "Keith Wiley" <[EMAIL PROTECTED]> wrote:
I can use cacheFile to load .so files into the distributed cache and it
works fine (the streaming executable links against the .so and runs), but I
can't get it to work with -cacheArchive.  It always says it can't find the
.so file.  I realize that if you jar a directory, the directory will be
recreated when you unjar, but I've tried jaring a file directly.  It is
easily verified that unjarring such a file reproduces the original file as a
sibling of the jar file itself.  So it seems to me that cacheArchive should
have transferred the jar file to the cwd of my task, unjarred it, and
produced a .so file right there, but it doesn't link up with the executable.
 Like I said, I know this basic approach works just fine with cacheFile.

What could be the problem here?  I can't easily see the files on the cluster
since it is a remote cluster with limited access.  I don't believe I can ssh
to any individual machine to investigate the files that are created for a
task...but I think I have worked through the process logically and I'm not
sure what I'm doing wrong.


Keith Wiley     *[EMAIL PROTECTED]*     keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda