Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Performance of direct vs indirect shuffling

Copy link to this message
Re: Performance of direct vs indirect shuffling
On Tue, Dec 20, 2011 at 4:53 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:

> The advantages of the "pull" based shuffle is fault tolerance - if you
> shuffle to the reducer and then the reducer dies, you have to rerun
> *all* of the earlier maps in the "push" model.

you would have the same situation if you aren't replicating the blocks in
the mapper.

in my situation I'm replicating the shuffle data so it should be a zero sum

The map jobs are just re-run where the last one failed since the shuffle
data has already been written.

(I should note that I'm working on another Map Reduce implementation that
I'm about to OSS)...

There are a LOT of problems in the map reduce space which are themselves
research papers and it would be nice to see more published in this area.
> The advantage of writing to disk is of course that you can have more
> intermediate output than fits in RAM.
well if you're shuffling across the network and you back up due to network
IO then your map jobs would just run slower.
> In practice, for short jobs, the output might stay entirely in buffer
> cache and never actually hit disk (RHEL by default configures the
> writeback period to 30 seconds when there isn't page cache pressure).
Or just start to block when memory is exhausted.