Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Streaming data locality


Copy link to this message
-
Streaming data locality
I've seen this asked before, but haven't seen a response yet.

If the input to a streaming job is not actual data splits but simple HDFS file names which are then read by the mappers, then how can data locality be achieved.

Likewise, is there any easier way to make those files accessible other than using the -cacheFile flag?  That requires building a very very long hadoop command (100s of files potentially).  I'm worried about overstepping some command-line length limit...plus it would be nice to do this programatically, say with the DistributedCache.addCacheFile() command, but that requires writing your own driver, which I don't see how to do with streaming.

Thoughts?

Thanks.

________________________________________________________________________________
Keith Wiley               [EMAIL PROTECTED]               www.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
  -- Abe (Grandpa) Simpson
________________________________________________________________________________
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB