Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Streaming data locality

Copy link to this message
Re: Streaming data locality

On Feb 3, 2011, at 9:29 AM, David Rosenstrauch wrote:

> On 02/03/2011 12:16 PM, Keith Wiley wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> If the input to a streaming job is not actual data splits but simple
>> HDFS file names which are then read by the mappers, then how can data
>> locality be achieved.
>> Likewise, is there any easier way to make those files accessible
>> other than using the -cacheFile flag?  That requires building a very
>> very long hadoop command (100s of files potentially).  I'm worried
>> about overstepping some command-line length limit...plus it would be
>> nice to do this programatically, say with the
>> DistributedCache.addCacheFile() command, but that requires writing
>> your own driver, which I don't see how to do with streaming.
>> Thoughts?
> Submit the job in a Java app instead of via streaming?  Have a big loop where you repeatedly call job.addInputPath.  (Or, if you're going to have a large number of input files, use CombineFileInputFormat for efficiency.)
Well, I know how to write a typical Hadoop driver which "extends Configured implements Tool" if that's what you mean, but then how to I kick off a streaming job from that driver?  I only know how to start a "normal" Java Hadoop job that way (via JobClient.runJob(conf);).  How do I start a streaming job using that method?  I only know how to start a streaming job by launching the streaming jar from the command line?

Does my question make sense?

Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei