Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Streaming data locality


Copy link to this message
-
Re: Streaming data locality

On Feb 3, 2011, at 9:29 AM, David Rosenstrauch wrote:

> On 02/03/2011 12:16 PM, Keith Wiley wrote:
>> I've seen this asked before, but haven't seen a response yet.
>>
>> If the input to a streaming job is not actual data splits but simple
>> HDFS file names which are then read by the mappers, then how can data
>> locality be achieved.
>>
>> Likewise, is there any easier way to make those files accessible
>> other than using the -cacheFile flag?  That requires building a very
>> very long hadoop command (100s of files potentially).  I'm worried
>> about overstepping some command-line length limit...plus it would be
>> nice to do this programatically, say with the
>> DistributedCache.addCacheFile() command, but that requires writing
>> your own driver, which I don't see how to do with streaming.
>>
>> Thoughts?
>
> Submit the job in a Java app instead of via streaming?  Have a big loop where you repeatedly call job.addInputPath.  (Or, if you're going to have a large number of input files, use CombineFileInputFormat for efficiency.)
Well, I know how to write a typical Hadoop driver which "extends Configured implements Tool" if that's what you mean, but then how to I kick off a streaming job from that driver?  I only know how to start a "normal" Java Hadoop job that way (via JobClient.runJob(conf);).  How do I start a streaming job using that method?  I only know how to start a streaming job by launching the streaming jar from the command line?

Does my question make sense?

________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB