IS there any documentation on the internals of the shuffle and sort phase?
The elephant book seems to be the best source, but it appears to only
lightly touch upon the "magic" part (i.e. the distributed merge sorting and
Also... What is the rationale behind the sortedness of mapper outputs? Is
the reason to optimize the streaming of mapper values to reducers? In
simple scenarios, i.e. when there is no reducing to be done, it seems that
we may not care to have sorted mapper outputs : a random merge of all
spilled records would be sufficient.
I've noticed that the Shuffle and Sort classes in hadoop have almost no
comments and appear to simply wrap other classes.