Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Partitioned Datasets Map/Reduce


Copy link to this message
-
Re: Partitioned Datasets Map/Reduce
One possibility: write out all the partition numbers (one per line) to a
single file, then use the NLineInputFormat to make each line its own map
task. Then in your mapper itself, you will get in a key of "0" or "1" or "2"
etc. Then explicitly open /dataset1/part-(n) and /dataset2/part-(n) in your
mapper.

If you wanted to be more clever, it might be possible to subclass
MultiFileInputFormat to group together both datasets "file-number-wise" when
generating splits, but I don't have specific guidance here.

- Aaron

On Sat, Jul 3, 2010 at 9:35 AM, abc xyz <[EMAIL PROTECTED]> wrote:

>
>
> Hello everyone,
>
>
> I have written my custom partitioner for partitioning datasets. I want  to
> partition two datasets using the same partitioner and then in the  next
> mapreduce job, I want each mapper to handle the same partition from  the
> two
> sources and perform some function such as joining etc. How I  can I ensure
> that
> one mapper gets the split that corresponds to same  partition from both the
> sources?
>
>
> Any help would be highly appreciated.
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB