-MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)
Saptarshi Guha 2009-12-23, 20:46
I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a question.
Suppose my Map sends key-value pairs to the reducers.
In my reducer, for a given key value, i emit key1,value1, key2,value2, ... ,
e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889
(stored as bytes writable
variation) and value is several lines
In the reduce, i emit
key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values
key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and so
(The key, values stored in a binary form, the comparator is a binary
So the reduce will be emitting keys in a not necessarily sorted order and
MapOutputFormat throws the following exception:
Reduce:java.io.IOException: key out of order:
after "e72e96c506c4e5cefbc2889e124228f67d121" "10"
(out of order using binary comparator)
I know the reduce receives keys in sorted order, but the keys it emits may not
be, so I'm not totally surprised.
Q1: Is this expected with MapFileOutputFormat?
Q2: Is the work around to emit as SequenceFileOutputFormat, then run an identity
map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this force
the user to use double the space(at least)?