Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Best approach for accessing secondary map task outputs from reduce tasks?


Copy link to this message
-
Re: Best approach for accessing secondary map task outputs from reduce tasks?
If these assumptions are correct:

0) Each map outputs one result, a few hundred bytes
1) The map output is deterministic, given an input split index
2) Every reducer must see the result from every map

Then just output the result N times, where N is the number of
reducers, using a custom Partitioner that assigns the result to
(records_seen++ % N), where records_seen is an int field on the
partitioner.

If (1) does not hold, then write the first stage as job with a single
(optional) reduce, and the second stage as a map-only job processing
the result. -C

On Sun, Feb 13, 2011 at 12:18 PM, Jacques <[EMAIL PROTECTED]> wrote:
> I'm outputting a small amount of secondary summary information from a map
> task that I want to use in the reduce phase of the job.  This information is
> keyed on a custom input split index.
>
> Each map task outputs this summary information (less than hundred bytes per
> input task).  Note that the summary information isn't ready until the
> completion of the map task.
>
> Each reduce task needs to read this information (for all input splits) to
> complete its task.
>
> What is the best way to pass this information to the Reduce stage?  I'm
> working on java using cdhb2.   Ideas I had include:
>
> 1. Output this data to MapContext.getWorkOutputPath().  However, that data
> is not available anywhere in the reduce stage.
> 2. Output this data to "mapred.output.dir".  The problem here is that the
> map task writes immediately to this so failed jobs and speculative execution
> could cause collision issues.
> 3. Output this data as in (1) and then use Mapper.cleanup() to copy these
> files to "mapred.output.dir".  Could work but I'm still a little concerned
> about collision/race issues as I'm not clear about when a Map task becomes
> "the" committed map task for that split.
> 4. Use an external system to hold this information and then just call that
> system from both phases.  This is basically an alternative of #3 and has the
> same issues.
>
> Are there suggested approaches of how to do this?
>
> It seems like (1) might make the most sense if there is a defined way to
> stream secondary outputs from all the mappers within the Reduce.setup()
> method.
>
> Thanks for any ideas.
>
> Jacques
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB