Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Shuffle In Memory OutOfMemoryError


Copy link to this message
-
RE: Shuffle In Memory OutOfMemoryError

   Thanks Ted.  My understanding is that MAPREDUCE-1182 is included in the 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and re-ran the same job scenarios.  Running with mapred.reduce.parallel.copies set to 1 and continue to have the same Java heap space error.

    

-----Original Message-----
From: Ted Yu [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 09, 2010 12:56 PM
To: [EMAIL PROTECTED]
Subject: Re: Shuffle In Memory OutOfMemoryError

This issue has been resolved in
http://issues.apache.org/jira/browse/MAPREDUCE-1182

Please apply the patch
M1182-1v20.patch<http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <[EMAIL PROTECTED]>wrote:

>
>  Thanks Ted.  Very helpful.  You are correct that I misunderstood the code
> at ReduceTask.java:1535.  I missed the fact that it's in a IOException catch
> block.  My mistake.  That's what I get for being in a rush.
>
>  For what it's worth I did re-run the job with
> mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
>  All failed with the same error:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
>   So from that it does seem like something else might be going on, yes?  I
> need to do some more research.
>
>  I appreciate your insights.
>
>  Andy
>
> -----Original Message-----
> From: Ted Yu [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, March 07, 2010 3:38 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> My observation is based on this call chain:
> MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> ramManager.canFitInMemory(decompressedLength)
>
> Basically ramManager.canFitInMemory() makes decision without considering
> the
> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
> may be used in shuffling if default parameters were used.
> Of course, you should check the value for mapred.reduce.parallel.copies to
> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>
> About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
> has 2731 lines. So I have to guess the location of the code snippet you
> provided.
> I found this around line 1535:
>        } catch (IOException ioe) {
>          LOG.info("Failed to shuffle from " +
> mapOutputLoc.getTaskAttemptId(),
>                   ioe);
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
> Please confirm the line number.
>
> If we're looking at the same code, I am afraid I don't see how we can
> improve it. First, I assume IOException shouldn't happen that often.
> Second,
> mapOutput.discard() just sets:
>          data = null;
> for in memory case. Even if we call mapOutput.discard() before
> ramManager.unreserve(), we don't know when GC would kick in and make more
> memory available.
> Of course, given the large number of map outputs in your system, it became
> more likely that the root cause from my reasoning made OOME happen sooner.
>
> Thanks
>
> >
> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <[EMAIL PROTECTED]
> >wrote:
>
> >
> >   Ted,
> >
> >   I'm trying to follow the logic in your mail and I'm not sure I'm
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB