Just a question about the implementation of Map/Reduce.
I've been thinking about the output of the map stage.
Logically all of the records emitted by the mapper have to be partitioned and
sorted before they go into the reducers. (We can ignore the partitioning for the
moment and so I'm just interested in the sorting.)
Now it seems to me that the obvious way to do this would be to have
some sort of sorted structure (balanced binary tree for example) so that
the (K, V) pairs emitted would be held in sorted order.
But when I read White's TDG (3ed p209) it's pretty explicit that the data
emitted is just held in a circular buffer in memory and that the sort occurs
in-memory as part of the dump to disc. Examining the code
(version 1.0.4) MapTask.java proves this to be the case.
So the first question is why is it done this way?
As far as I can see buffering the objects and doing one sort at the
end is going to be computational complexity of order NlgN
but maintaining a sorted in-memory structure is going to be of
the same order so, at least asymptotically, there isn't going to be
much difference between the two approaches.
I can see that by serializing the objects into the circular buffer
immediately you can minimize the number of key and value objects
that need to be instantiated. (Particularly if the mapper re-uses its
objects and custom comparators are used then barely any objects
need to be constructed or destroyed.) In this way I guess that the
load on the heap can be kept minimal.
Is this the reason that it is done this way?
Is there some other reason, perhaps a property of Java (I'm no
expert) which makes one way preferable to the other?
It's just that if the emitted data were held sorted then the opportunity
to run the "combiner" much earlier seems to be much easier.
For example if were running wordCount (or anything that needs
the SumReducer) then we could just increment the counts
held in-memory and we'd never have to emit duplicates.
(In fact I'm tempted to hold a HashSet of references to the
(K,V) pairs held serialized in the buffer and then incrementing
the counts in memory so that I don't need to emit duplicates.
But this seems perverse and far too implementation dependent.)