Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Best approach for accessing secondary map task outputs from reduce tasks?


Copy link to this message
-
Re: Best approach for accessing secondary map task outputs from reduce tasks?
Chris Douglas 2011-02-14, 13:43
On Sun, Feb 13, 2011 at 8:22 PM, Jason <[EMAIL PROTECTED]> wrote:
> I think this kind of partitioner is a little hackish. More straight forward approach is to emit the extra data N times under special keys and write a partitioner that would recognize these keys and dispatch them accordingly between partitions 0..N-1
> Also if this data needs to be shipped to reducers upfront, it could be easily done using custom sort comparator

As listed in the assumptions, I thought each map emits only one datum
that must be read by every reduce. Not one special datum among the
normal output. Changing the output record type to add the partition
struck me as overly formal, so the hackish solution seemed
appropriate. If the summary data complement record data emitted from
the map, then composing the job as you describe is standard.

However, if the map is non-deterministic, then- again- all of the
output (not just the summary data) from the first stage must go to
durable storage (i.e. HDFS), or re-executions will yield inconsistent
results. I haven't set up MO to effect the shuffle in HDFS as Harsh
describes, but it could be made to work. -C