Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Re: multiple file input


+
laser08150815 2009-12-08, 14:19
+
pmg 2009-06-18, 17:56
+
Owen OMalley 2009-06-18, 20:36
+
pmg 2009-06-19, 04:38
+
pmg 2009-06-19, 21:41
+
Tarandeep Singh 2009-06-19, 22:18
+
pmg 2009-06-19, 22:45
+
Tarandeep Singh 2009-06-19, 23:11
+
pmg 2009-06-19, 23:33
+
Tarandeep Singh 2009-06-20, 00:26
+
pmg 2009-06-20, 00:53
+
pmg 2009-06-20, 16:36
Copy link to this message
-
Re: multiple file input
On Thu, Jun 18, 2009 at 01:36:14PM -0700, Owen O'Malley wrote:
> On Jun 18, 2009, at 10:56 AM, pmg wrote:
>
> >Each line from FileA gets compared with every line from FileB1,  
> >FileB2 etc.
> >etc. FileB1, FileB2 etc. are in a different input directory
>
> In the general case, I'd define an InputFormat that takes two  
> directories, computes the input splits for each directory and  
> generates a new list of InputSplits that is the cross-product of the  
> two lists. So instead of FileSplit, it would use a FileSplitPair that  
> gives the FileSplit for dir1 and the FileSplit for dir2 and the record  
> reader would return a TextPair with left and right records (ie.  
> lines). Clearly, you read the first line of split1 and cross it by  
> each line from split2, then move to the second line of split1 and  
> process each line from split2, etc.
>

Out of curiosity, how does Hadoop schedule tasks when a task needs
multiple inputs and the data for a task is on different nodes?  How does
it decide which node will be more "local" and should have the task
steered to it?

-Erik
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB