|
|
-
MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)
Saptarshi Guha 2009-12-23, 20:46
Hello,
I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a question. Suppose my Map sends key-value pairs to the reducers. In my reducer, for a given key value, i emit key1,value1, key2,value2, ... , keyn,valuen
e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889 (stored as bytes writable variation) and value is several lines
In the reduce, i emit
key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and so on
(The key, values stored in a binary form, the comparator is a binary comparator).
So the reduce will be emitting keys in a not necessarily sorted order and MapOutputFormat throws the following exception:
Reduce:java.io.IOException: key out of order: "e780f987932c84d41e4f14d7607fcb69c6889" "1" after "e72e96c506c4e5cefbc2889e124228f67d121" "10" at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:206) at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:192) at org.godhuli.rhipe.RHMapFileOutputFormat$1.write(RHMapFileOutputFormat.java:79) (out of order using binary comparator)
I know the reduce receives keys in sorted order, but the keys it emits may not be, so I'm not totally surprised.
Q1: Is this expected with MapFileOutputFormat? Q2: Is the work around to emit as SequenceFileOutputFormat, then run an identity map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this force the user to use double the space(at least)? Much thanks Saptarshi
-
Re: MapFileoutput Format: keys out of order when emitting in reduce (Hadoop 0.20)
Todd Lipcon 2009-12-23, 23:33
On Wed, Dec 23, 2009 at 12:46 PM, Saptarshi Guha <[EMAIL PROTECTED]>wrote:
> Hello, > > I re-wrote MapFileOutputFormat for use with Hadoop 0.20.1 and have a > question. > Suppose my Map sends key-value pairs to the reducers. > In my reducer, for a given key value, i emit key1,value1, key2,value2, ... > , > keyn,valuen > > e.g the key (sent to reduce) is e780f987932c84d41e4f14d7607fcb69c6889 > (stored as bytes writable > variation) and value is several lines > > In the reduce, i emit > > key=(e780f987932c84d41e4f14d7607fcb69c6889, 1), value= subset of values > key=(e780f987932c84d41e4f14d7607fcb69c6889, 2), value= subset of values and > so > on > > (The key, values stored in a binary form, the comparator is a binary > comparator). > > So the reduce will be emitting keys in a not necessarily sorted order and > MapOutputFormat throws the following exception: > > Reduce:java.io.IOException: key out of order: > "e780f987932c84d41e4f14d7607fcb69c6889" "1" > after "e72e96c506c4e5cefbc2889e124228f67d121" "10" > at org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:206) > at org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:192) > at > org.godhuli.rhipe.RHMapFileOutputFormat$1.write(RHMapFileOutputFormat.java:79) > (out of order using binary comparator) > > I know the reduce receives keys in sorted order, but the keys it emits may > not > be, so I'm not totally surprised. > > Q1: Is this expected with MapFileOutputFormat? >
Yes. MapFiles are stored in sorted order so that lookup can be done with binary search.
Q2: Is the work around to emit as SequenceFileOutputFormat, then run an > identity > map (with a reduce) and output as MapFileOutputFormat? If so, doesn't this > force > the user to use double the space(at least)? > > Yes, but you can remove the intermediate output when you're done. On most clusters, the size of data that you can run jobs on in a reasonably short timeframe is fairly small compared to the total capacity. That is to say, on a 10 node cluster, each with 4x1TB disks, it will take many hours to sort even 5-10TB.
-Todd
|
|