-Multiple various streaming questions
Keith Wiley 2011-02-02, 07:40
I would really appreciate any help people can offer on the following matters.
When running a streaming job, -D, -files, -libjars, and -archives don't seem work, but -jobconf, -file, -cacheFile, and -cacheArchive do. With the first four parameters anywhere in command I always get a "Streaming Command Failed!" error. The last four work though. Note that some of those parameters (-files) do work when I a run a Hadoop job in the normal framework, just not when I specify the streaming jar.
How do I specify a Java class as the reducer? I have found examples online, but they always reference "built-in" classes. If I try to use my own class, the job tracker produces a "Cannot run program "org.uw.astro.coadd.Reducer2": java.io.IOException: error=2, No such file or directory" error. As you can see from my first question, I am certainly trying to find ways to include the .jar file containing the class in the distributed cache, but -libjars and -archives don't work, and if I upload the .jar to the cluster and use -cacheArchives, the command runs but I still get the "No such file" error. I can use native compiled programs for the mapper and reducer just fine, but not a Java class. I want a native mapper and a Java reducer. My native mapper runs, but then the Java reducer fails as described.
How do I force a single record (input file) to be processed by a single mapper to get maximum parallelism? All I found online was this terse description (of an example that gzips files, not my application):
• Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
• Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
I don't understand exactly what that means and how to go about doing it. In the normal Hadoop framework I have achieved this goal by setting mapred.max.split.size small enough that only one input record fits (about 6MBs), but I tried that with my streaming job ala "-jobconf mapred.max.split.size=X" where X is a very low number, about as many as a single streaming input record (which in the streaming case is not 6MB, but merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't work, it sent multiple records to each mapper anyway. Achieving 1-to-1 parallelism between map tasks, nodes, and input records is very import because my map tasks take a very long time to run, upwards of an hour. I cannot have them queueing up on a small number of nodes while there are numerous unused nodes (task slots) available to be doing work.
I realize I'm asking a lot of questions here, but I would greatly appreciate any assistance on these issues.
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter."