Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)


Copy link to this message
-
MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)
Hello,

I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a question.
Suppose my Map sends key-value pairs to the reducers.
In my reducer, for a given key value, i emit key1,value1, key2,value2, ... ,
keyn,valuen

e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889
(stored as bytes writable
variation) and value is several lines

In the reduce, i emit

key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values
key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and so
on

(The key, values stored in a binary form, the comparator is a binary
comparator).

So the reduce will be emitting keys in a not necessarily sorted order and
MapOutputFormat throws the following exception:

Reduce:java.io.IOException: key out of order:
 "e780f987932c84d41e4f14d7607fcb69c6889" "1"
 after "e72e96c506c4e5cefbc2889e124228f67d121" "10"
at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:206)
        at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:192)
        at org.godhuli.rhipe.RHMapFileOutputFormat$1.write(RHMapFileOutputFormat.java:79)
(out of order using binary comparator)

I know the reduce receives keys in sorted order, but the keys it emits may not
be, so I'm not totally surprised.

Q1: Is this expected with MapFileOutputFormat?
Q2: Is the work around to emit as SequenceFileOutputFormat, then run an identity
map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this force
the user to use double the space(at least)?
Much thanks
Saptarshi
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB