Thanks very much for your helpful response!
I should go into some more details about this job. It's essentially a
use of the Hadoop framework to sort a large amount of data. The mapper
transforms a record to (sorting_key, record), where the sorting keys
are effectively unique, and the reducer is trivial, outputting the
record and discarding the sorting key, so the memory consumption of
both the map and the reduce steps is intended to be O(1).
However, due to the nature of the sorting, it's necessary that certain
sets of records appear together in the sorted output. Thus the
partitioner (HashPartitioner with a specially designed hash function)
will sometimes be forced to send a large number of records to a
particular reducer. This is not desirable, and it occurs only rarely,
but it's not feasible to prevent it from happening on a deterministic
basis. You could say that it creates a reliability engineering
My understanding of the configuration options you've linked to is that
they're intended for performance tuning, and that even if the defaults
are not optimal for a particular input, the shuffle should still
succeed, albeit more slowly than it could have otherwise. In
particular, it seems like the ShuffleRamManager class (I think
ReduceTask.ReduceCopier.ShuffleRamManager) is intended to prevent my
crash from occurring, by disallowing the in-memory shuffle from using
up all the JVM heap.
Is it possible that the continued existence of this OutOfMemoryError
represents a bug in ShuffleRamManager, or in some other code that is
intended to prevent this situation from occurring?
Thanks so much for your time.
On Wed, Feb 20, 2013 at 5:41 PM, Hemanth Yamijala
<[EMAIL PROTECTED]> wrote:
> There are a few tweaks In configuration that may help. Can you please look
> Also, since you have mentioned reducers are unbalanced, could you use a
> custom partitioner to balance out the outputs. Or just increase the number
> of reducers so the load is spread out.
> On Wednesday, February 20, 2013, Shivaram Lingamneni wrote:
>> I'm experiencing the following crash during reduce tasks:
>> on Hadoop 1.0.3 (specifically I'm using Amazon's EMR, AMI version
>> 2.2.1). The crash is triggered by especially unbalanced reducer
>> inputs, i.e., when one reducer receives too many records. (The reduce
>> task gets retried three times, but since the data is the same every
>> time, it crashes each time in the same place and the job fails.)
>> From the following links:
>> it seems as though Hadoop is supposed to prevent this from happening
>> by intelligently managing the amount of memory that is provided to the
>> shuffle. However, I don't know how ironclad this guarantee is.
>> Can anyone advise me on how robust I can expect Hadoop to be to this
>> issue, in the face of highly unbalanced reducer inputs? Thanks very
>> much for your time.