Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> Broad question on sorting of mapper outputs.

Copy link to this message
Broad question on sorting of mapper outputs.
IS there any documentation on the internals of the shuffle and sort phase?
The elephant book seems to be the best source, but it appears to only
lightly touch upon the "magic" part (i.e. the distributed merge sorting and
mapper spilling).

Also... What is the rationale behind the sortedness of mapper outputs?  Is
the reason to optimize the streaming of mapper values to reducers?  In
simple scenarios, i.e. when there is no reducing to be done, it seems that
we may not care to have sorted mapper outputs : a random merge of all
spilled records would be sufficient.

I've noticed that the Shuffle and Sort classes in hadoop have almost no
comments and appear to simply wrap other classes.

Jay Vyas
anil gupta 2012-10-24, 22:01