Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Streaming data locality

Copy link to this message
Streaming data locality
I've seen this asked before, but haven't seen a response yet.

If the input to a streaming job is not actual data splits but simple HDFS file names which are then read by the mappers, then how can data locality be achieved.

Likewise, is there any easier way to make those files accessible other than using the -cacheFile flag?  That requires building a very very long hadoop command (100s of files potentially).  I'm worried about overstepping some command-line length limit...plus it would be nice to do this programatically, say with the DistributedCache.addCacheFile() command, but that requires writing your own driver, which I don't see how to do with streaming.



Keith Wiley               [EMAIL PROTECTED]               www.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
  -- Abe (Grandpa) Simpson