-Re: Best approach for accessing secondary map task outputs from reduce tasks?
If these assumptions are correct:
0) Each map outputs one result, a few hundred bytes
1) The map output is deterministic, given an input split index
2) Every reducer must see the result from every map
Then just output the result N times, where N is the number of
reducers, using a custom Partitioner that assigns the result to
(records_seen++ % N), where records_seen is an int field on the
If (1) does not hold, then write the first stage as job with a single
(optional) reduce, and the second stage as a map-only job processing
the result. -C
On Sun, Feb 13, 2011 at 12:18 PM, Jacques <[EMAIL PROTECTED]> wrote:
> I'm outputting a small amount of secondary summary information from a map
> task that I want to use in the reduce phase of the job. This information is
> keyed on a custom input split index.
> Each map task outputs this summary information (less than hundred bytes per
> input task). Note that the summary information isn't ready until the
> completion of the map task.
> Each reduce task needs to read this information (for all input splits) to
> complete its task.
> What is the best way to pass this information to the Reduce stage? I'm
> working on java using cdhb2. Ideas I had include:
> 1. Output this data to MapContext.getWorkOutputPath(). However, that data
> is not available anywhere in the reduce stage.
> 2. Output this data to "mapred.output.dir". The problem here is that the
> map task writes immediately to this so failed jobs and speculative execution
> could cause collision issues.
> 3. Output this data as in (1) and then use Mapper.cleanup() to copy these
> files to "mapred.output.dir". Could work but I'm still a little concerned
> about collision/race issues as I'm not clear about when a Map task becomes
> "the" committed map task for that split.
> 4. Use an external system to hold this information and then just call that
> system from both phases. This is basically an alternative of #3 and has the
> same issues.
> Are there suggested approaches of how to do this?
> It seems like (1) might make the most sense if there is a defined way to
> stream secondary outputs from all the mappers within the Reduce.setup()
> Thanks for any ideas.