Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Multiple various streaming questions

Copy link to this message
Multiple various streaming questions
I would really appreciate any help people can offer on the following matters.

When running a streaming job, -D, -files, -libjars, and -archives don't seem work, but -jobconf, -file, -cacheFile, and -cacheArchive do.  With the first four parameters anywhere in command I always get a "Streaming Command Failed!" error.  The last four work though.  Note that some of those parameters (-files) do work when I a run a Hadoop job in the normal framework, just not when I specify the streaming jar.

How do I specify a Java class as the reducer?  I have found examples online, but they always reference "built-in" classes.  If I try to use my own class, the job tracker produces a "Cannot run program "org.uw.astro.coadd.Reducer2": java.io.IOException: error=2, No such file or directory" error.  As you can see from my first question, I am certainly trying to find ways to include the .jar file containing the class in the distributed cache, but -libjars and -archives don't work, and if I upload the .jar to the cluster and use -cacheArchives, the command runs but I still get the "No such file" error.  I can use native compiled programs for the mapper and reducer just fine, but not a Java class.  I want a native mapper and a Java reducer.  My native mapper runs, but then the Java reducer fails as described.

How do I force a single record (input file) to be processed by a single mapper to get maximum parallelism?  All I found online was this terse description (of an example that gzips files, not my application):
• Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
• Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
I don't understand exactly what that means and how to go about doing it.  In the normal Hadoop framework I have achieved this goal by setting mapred.max.split.size small enough that only one input record fits (about 6MBs), but I tried that with my streaming job ala "-jobconf mapred.max.split.size=X" where X is a very low number, about as many as a single streaming input record (which in the streaming case is not 6MB, but merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't work, it sent multiple records to each mapper anyway.  Achieving 1-to-1 parallelism between map tasks, nodes, and input records is very import because my map tasks take a very long time to run, upwards of an hour.  I cannot have them queueing up on a small number of nodes while there are numerous unused nodes (task slots) available to be doing work.

I realize I'm asking a lot of questions here, but I would greatly appreciate any assistance on these issues.


Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda