Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Shuffle In Memory OutOfMemoryError


Copy link to this message
-
RE: Shuffle In Memory OutOfMemoryError

   Thanks Ted.  My understanding is that MAPREDUCE-1182 is included in the 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and re-ran the same job scenarios.  Running with mapred.reduce.parallel.copies set to 1 and continue to have the same Java heap space error.

    

-----Original Message-----
From: Ted Yu [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 09, 2010 12:56 PM
To: [EMAIL PROTECTED]
Subject: Re: Shuffle In Memory OutOfMemoryError

This issue has been resolved in
http://issues.apache.org/jira/browse/MAPREDUCE-1182

Please apply the patch
M1182-1v20.patch<http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch>

On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <[EMAIL PROTECTED]>wrote:

>
>  Thanks Ted.  Very helpful.  You are correct that I misunderstood the code
> at ReduceTask.java:1535.  I missed the fact that it's in a IOException catch
> block.  My mistake.  That's what I get for being in a rush.
>
>  For what it's worth I did re-run the job with
> mapred.reduce.parallel.copies set with values from 5 all the way down to 1.
>  All failed with the same error:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
>   So from that it does seem like something else might be going on, yes?  I
> need to do some more research.
>
>  I appreciate your insights.
>
>  Andy
>
> -----Original Message-----
> From: Ted Yu [mailto:[EMAIL PROTECTED]]
> Sent: Sunday, March 07, 2010 3:38 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Shuffle In Memory OutOfMemoryError
>
> My observation is based on this call chain:
> MapOutputCopier.run() calling copyOutput() calling getMapOutput() calling
> ramManager.canFitInMemory(decompressedLength)
>
> Basically ramManager.canFitInMemory() makes decision without considering
> the
> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of total heap
> may be used in shuffling if default parameters were used.
> Of course, you should check the value for mapred.reduce.parallel.copies to
> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>
> About ramManager.unreserve() call, ReduceTask.java from hadoop 0.20.2 only
> has 2731 lines. So I have to guess the location of the code snippet you
> provided.
> I found this around line 1535:
>        } catch (IOException ioe) {
>          LOG.info("Failed to shuffle from " +
> mapOutputLoc.getTaskAttemptId(),
>                   ioe);
>
>          // Inform the ram-manager
>          ramManager.closeInMemoryFile(mapOutputLength);
>          ramManager.unreserve(mapOutputLength);
>
>          // Discard the map-output
>          try {
>            mapOutput.discard();
>          } catch (IOException ignored) {
>            LOG.info("Failed to discard map-output from " +
>                     mapOutputLoc.getTaskAttemptId(), ignored);
>          }
> Please confirm the line number.
>
> If we're looking at the same code, I am afraid I don't see how we can
> improve it. First, I assume IOException shouldn't happen that often.
> Second,
> mapOutput.discard() just sets:
>          data = null;
> for in memory case. Even if we call mapOutput.discard() before
> ramManager.unreserve(), we don't know when GC would kick in and make more
> memory available.
> Of course, given the large number of map outputs in your system, it became
> more likely that the root cause from my reasoning made OOME happen sooner.
>
> Thanks
>
> >
> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <[EMAIL PROTECTED]
> >wrote:
>
> >
> >   Ted,
> >
> >   I'm trying to follow the logic in your mail and I'm not sure I'm