Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Multiple various streaming questions


Copy link to this message
-
Multiple various streaming questions
I would really appreciate any help people can offer on the following matters.

When running a streaming job, -D, -files, -libjars, and -archives don't seem work, but -jobconf, -file, -cacheFile, and -cacheArchive do.  With the first four parameters anywhere in command I always get a "Streaming Command Failed!" error.  The last four work though.  Note that some of those parameters (-files) do work when I a run a Hadoop job in the normal framework, just not when I specify the streaming jar.

How do I specify a Java class as the reducer?  I have found examples online, but they always reference "built-in" classes.  If I try to use my own class, the job tracker produces a "Cannot run program "org.uw.astro.coadd.Reducer2": java.io.IOException: error=2, No such file or directory" error.  As you can see from my first question, I am certainly trying to find ways to include the .jar file containing the class in the distributed cache, but -libjars and -archives don't work, and if I upload the .jar to the cluster and use -cacheArchives, the command runs but I still get the "No such file" error.  I can use native compiled programs for the mapper and reducer just fine, but not a Java class.  I want a native mapper and a Java reducer.  My native mapper runs, but then the Java reducer fails as described.

How do I force a single record (input file) to be processed by a single mapper to get maximum parallelism?  All I found online was this terse description (of an example that gzips files, not my application):
• Generate a file containing the full HDFS path of the input files. Each map task would get one file name as input.
• Create a mapper script which, given a filename, will get the file to local disk, gzip the file and put it back in the desired output directory
I don't understand exactly what that means and how to go about doing it.  In the normal Hadoop framework I have achieved this goal by setting mapred.max.split.size small enough that only one input record fits (about 6MBs), but I tried that with my streaming job ala "-jobconf mapred.max.split.size=X" where X is a very low number, about as many as a single streaming input record (which in the streaming case is not 6MB, but merely ~100 bytes, just a filename referenced ala -cacheFile), but it didn't work, it sent multiple records to each mapper anyway.  Achieving 1-to-1 parallelism between map tasks, nodes, and input records is very import because my map tasks take a very long time to run, upwards of an hour.  I cannot have them queueing up on a small number of nodes while there are numerous unused nodes (task slots) available to be doing work.

I realize I'm asking a lot of questions here, but I would greatly appreciate any assistance on these issues.

Thanks.

________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB