Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)


Copy link to this message
-
MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)
Saptarshi Guha 2009-12-23, 20:46
Hello,

I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a question.
Suppose my Map sends key-value pairs to the reducers.
In my reducer, for a given key value, i emit key1,value1, key2,value2, ... ,
keyn,valuen

e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889
(stored as bytes writable
variation) and value is several lines

In the reduce, i emit

key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values
key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and so
on

(The key, values stored in a binary form, the comparator is a binary
comparator).

So the reduce will be emitting keys in a not necessarily sorted order and
MapOutputFormat throws the following exception:

Reduce:java.io.IOException: key out of order:
 "e780f987932c84d41e4f14d7607fcb69c6889" "1"
 after "e72e96c506c4e5cefbc2889e124228f67d121" "10"
at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:206)
        at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:192)
        at org.godhuli.rhipe.RHMapFileOutputFormat$1.write(RHMapFileOutputFormat.java:79)
(out of order using binary comparator)

I know the reduce receives keys in sorted order, but the keys it emits may not
be, so I'm not totally surprised.

Q1: Is this expected with MapFileOutputFormat?
Q2: Is the work around to emit as SequenceFileOutputFormat, then run an identity
map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this force
the user to use double the space(at least)?
Much thanks
Saptarshi