Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> question about org.apache.hadoop.mapred.join


Copy link to this message
-
Re: question about org.apache.hadoop.mapred.join
Your understanding is correct. The framework doesn't do anything to
align input splits across datasets. In the situation you describe-
where one can't seek among key groups in the input data- it often
makes sense to disable splitting of the individual files by setting
the min split size to Integer.MAX_VALUE.

The description probably shouldn't use "partitioned", since that
implies that the partitioner is sufficient. -C

On Tue, Apr 10, 2012 at 8:11 AM, Koert Kuipers <[EMAIL PROTECTED]> wrote:
> I read about CompositeInputFormat and how it allows one to join two datasets
> together as long as those datasets were sorted and partitioned the same way.
> Ok i think i get it, but something bothers me. It is suggested that two
> datasets are "sorted and partitioned the same way" if they were both outputs
> from the mapreduce process with the same number of reducers with the same
> sorting & partitioning. However, something like CompositeInputFormat depends
> on the splits lining up, and two datasets going through the same reducer
> setup doesn't guarantee that at all. Splits after all are based on stuff
> like data size in MBs, and the reducers do not control that this will be the
> same. part-00007 for dataset 1 could be a different size (and have different
> number of splits) than part-00007 for dataset 2, even if they have the same
> keys and are sorted the same way. So now CompositeInputFormat would not
> work. Is this correct?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB