Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Streaming data locality


Copy link to this message
-
Re: Streaming data locality

On Feb 3, 2011, at 6:25 PM, Allen Wittenauer wrote:

>
> On Feb 3, 2011, at 9:16 AM, Keith Wiley wrote:
>
>> I've seen this asked before, but haven't seen a response yet.
>>
>> If the input to a streaming job is not actual data splits but simple HDFS file names which are then read by the mappers, then how can data locality be achieved.
>
> If I understand your question, the method of processing doesn't matter.  The JobTracker places tasks based on input locality.  So if you are providing the names of the file you want as input as -input, then the JT will use the locations of those blocks.

Let's see here.  My streaming job has a single -input flag which points to a text file containing HDFS paths.  Each line contains one TAB.  Are you saying that if the key (or the value) on either side of that TAB is an HDFS file path then that record will be assigned to a task in a data local manner?  Which is it that determines this locality, the key or the value?  (Must be the key, right?)

> (Remember: streaming.jar is basically a big wrapper around the Java methods and the parameters you pass to it are essentially the same as you'd provide to a "real" Java app.)
>
> Or are you saying your -input is a list of other files to read?  In the case, there is no locality.  But again, streaming or otherwise makes no real difference.

Yes, basically.  The input is a list of HDFS file paths to be read and processed on a an individual basis.

>> Likewise, is there any easier way to make those files accessible other than using the -cacheFile flag?  
>> That requires building a very very long hadoop command (100s of files potentially).  I'm worried about overstepping some command-line length limit...plus it would be nice to do this programatically, say with the DistributedCache.addCacheFile() command, but that requires writing your own driver, which I don't see how to do with streaming.
>>
>> Thoughts?
>
> I think you need to give a more concrete example of what you are doing.  -cache is used for sending files with your job and should have no bearing on what your input is to your job.  Something tells me that you've cooked something up that is overly complex. :D

Good point, I'll write a better description of this later.  Thanks for the advice.

________________________________________________________________________________
Keith Wiley     [EMAIL PROTECTED]     keithwiley.com    music.keithwiley.com

"I used to be with it, but then they changed what it was.  Now, what I'm with
isn't it, and what's it seems weird and scary to me."
                                           --  Abe (Grandpa) Simpson
________________________________________________________________________________
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB