On Feb 3, 2011, at 9:29 AM, David Rosenstrauch wrote:
> On 02/03/2011 12:16 PM, Keith Wiley wrote:
>> I've seen this asked before, but haven't seen a response yet.
>> If the input to a streaming job is not actual data splits but simple
>> HDFS file names which are then read by the mappers, then how can data
>> locality be achieved.
>> Likewise, is there any easier way to make those files accessible
>> other than using the -cacheFile flag? That requires building a very
>> very long hadoop command (100s of files potentially). I'm worried
>> about overstepping some command-line length limit...plus it would be
>> nice to do this programatically, say with the
>> DistributedCache.addCacheFile() command, but that requires writing
>> your own driver, which I don't see how to do with streaming.
> Submit the job in a Java app instead of via streaming? Have a big loop where you repeatedly call job.addInputPath. (Or, if you're going to have a large number of input files, use CombineFileInputFormat for efficiency.)
Well, I know how to write a typical Hadoop driver which "extends Configured implements Tool" if that's what you mean, but then how to I kick off a streaming job from that driver? I only know how to start a "normal" Java Hadoop job that way (via JobClient.runJob(conf);). How do I start a streaming job using that method? I only know how to start a streaming job by launching the streaming jar from the command line?
Does my question make sense?
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
-- Galileo Galilei