|
|
-
Broad question on sorting of mapper outputs.Jay Vyas 2012-10-20, 03:19
IS there any documentation on the internals of the shuffle and sort phase?
The elephant book seems to be the best source, but it appears to only lightly touch upon the "magic" part (i.e. the distributed merge sorting and mapper spilling). Also... What is the rationale behind the sortedness of mapper outputs? Is the reason to optimize the streaming of mapper values to reducers? In simple scenarios, i.e. when there is no reducing to be done, it seems that we may not care to have sorted mapper outputs : a random merge of all spilled records would be sufficient. I've noticed that the Shuffle and Sort classes in hadoop have almost no comments and appear to simply wrap other classes. -- Jay Vyas http://jayunit100.blogspot.com |