Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> multiple file input

Copy link to this message
Re: multiple file input
On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote:
> On Jun 18, 2009, at 10:56 AM, pmg wrote:
> >Each line from FileA gets compared with every line from FileB1,  
> >FileB2 etc.
> >etc. FileB1, FileB2 etc. are in a different input directory
> In the general case, I'd define an InputFormat that takes two  
> directories, computes the input splits for each directory and  
> generates a new list of InputSplits that is the cross-product of the  
> two lists. So instead of FileSplit, it would use a FileSplitPair that  
> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
> reader would return a TextPair with left and right records (ie.  
> lines). Clearly, you read the first line of split1 and cross it by  
> each line from split2, then move to the second line of split1 and  
> process each line from split2, etc.

Out of curiosity, how does Hadoop schedule tasks when a task needs
multiple inputs and the data for a task is on different nodes?  How does
it decide which node will be more "local" and should have the task
steered to it?