|
|
-
Limit number of records or total size in combiner input using jobconf?
Saptarshi Guha 2009-02-13, 14:41
Hello, Running a MR job on 7 machines failed when it came to processing 53GB. Browsing the errors, org.saptarshiguha.rhipe.GRMapreduce$GRCombiner.reduce(GRMapreduce.java:149) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.combineAndSpill(MapTask.java:1106) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:979) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:391) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:876) The reason why my line failed is that there were too many records. I offload calculations to a another program and it screamed out of memory. Seeing the source in sortAndSpill where this happened:(hadoop -0.19) int spstart = spindex; while (spindex < endPosition && kvindices[kvoffsets[spindex % kvoffsets.length] + PARTITION] == i) { ++spindex; } // Note: we would like to avoid the combiner if we've fewer // than some threshold of records for a partition if (spstart != spindex) { combineCollector.setWriter(writer); RawKeyValueIterator kvIter = new MRResultIterator(spstart, spindex); combineAndSpill(kvIter, combineInputCounter); } So here are my questions: (1) is there a jobconf hint to limit the number of records in kviter? I can (and have) made a fix to my code that processes the values in a combiner step in batches (i.e takes N at a go,processes that and repeat), but was wondering if i could just set an option.
Since this occurred in the MapContext, changing the number of reducers wont help. (2) How does changing the number of reducers help at all? I have 7 machines, so I feel 11 (a prime close to 7, why a prime?) is good enough (some machines are 16GB others 32GB)
Regards Saptarshi -- Saptarshi Guha - [EMAIL PROTECTED]
-
Re: Limit number of records or total size in combiner input using jobconf?
Chris Douglas 2009-02-20, 22:34
> So here are my questions: > (1) is there a jobconf hint to limit the number of records in kviter? > I can (and have) made a fix to my code that processes the values in a > combiner step in batches (i.e takes N at a go,processes that and > repeat), but was wondering if i could just set an option.
Approximately and indirectly, yes. You can limit the amount of memory allocated to storing serialized records in memory (io.sort.mb) and the percentage of that space reserved for storing record metadata (io.sort.record.percent, IIRC). That can be used to limit the number of records in each spill, though you may also need to disable the combiner during the merge, where you may run into the same problem.
You're almost certainly better off designing your combiner to scale well (as you have), since you'll hit this in the reduce, too.
> Since this occurred in the MapContext, changing the number of reducers > wont help. > (2) How does changing the number of reducers help at all? I have 7 > machines, so I feel 11 (a prime close to 7, why a prime?) is good > enough (some machines are 16GB others 32GB)
Your combiner will look at all the records for a partition and only those records in a partition. If your partitioner distributes your records evenly in a particular spill, then increasing the total number of partitions will decrease the number of records your combiner considers in each call. For most partitioners, whether the number of reducers is prime should be irrelevant. -C
-
Re: Limit number of records or total size in combiner input using jobconf?
Saptarshi Guha 2009-02-24, 02:36
Thank you. On Fri, Feb 20, 2009 at 5:34 PM, Chris Douglas <[EMAIL PROTECTED]> wrote: >> So here are my questions: >> (1) is there a jobconf hint to limit the number of records in kviter? >> I can (and have) made a fix to my code that processes the values in a >> combiner step in batches (i.e takes N at a go,processes that and >> repeat), but was wondering if i could just set an option. > > Approximately and indirectly, yes. You can limit the amount of memory > allocated to storing serialized records in memory (io.sort.mb) and the > percentage of that space reserved for storing record metadata > (io.sort.record.percent, IIRC). That can be used to limit the number of > records in each spill, though you may also need to disable the combiner > during the merge, where you may run into the same problem. > > You're almost certainly better off designing your combiner to scale well (as > you have), since you'll hit this in the reduce, too. > >> Since this occurred in the MapContext, changing the number of reducers >> wont help. >> (2) How does changing the number of reducers help at all? I have 7 >> machines, so I feel 11 (a prime close to 7, why a prime?) is good >> enough (some machines are 16GB others 32GB) > > Your combiner will look at all the records for a partition and only those > records in a partition. If your partitioner distributes your records evenly > in a particular spill, then increasing the total number of partitions will > decrease the number of records your combiner considers in each call. For > most partitioners, whether the number of reducers is prime should be > irrelevant. -C >
|
|