Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Stability issue - dead DN's

Copy link to this message
Re: Stability issue - dead DN's

On May 11, 2011, at 5:57 AM, Eric Fiala wrote:
> If we do the math that means [ map.tasks.max * mapred.child.java.opts ]  +
> [ reduce.tasks.max * mapred.child.java.opts ] => or [ 4 * 2.5G ] + [ 4 *
> 2.5G ] is greater than the amount of physical RAM in the machine.
> This doesn't account for the base tasktracker and datanode process + OS
> overhead and whatever else may be hoarding resources on the systems.

+1 to what Eric said.

You've exhausted memory and now the whole system is falling apart.  

> I would play with this ratio, either less maps / reduces max - or lower your
> child.java.opts so that when you are fully subscribed you are not using
> more resource than the machine can offer.


> Also, setting mapred.reduce.slowstart.completed.maps  to 1.00 or some other
> value close to 1 would be one way to guarantee only 4 either maps or reduces
> to be running at once and address (albeit in a duct tape like way) the
> oversubscription problem you are seeing (this represents the fractions of
> maps that should complete before initiating the reduce phase).

slowstart isn't really going to help you much here.  All it takes is another job with the same settings running at the same time and processes will start dying again.  That said, the default for slowstart is incredibly stupid for the vast majority.  Something closer to .70 or .80 is more realistic.
>> * a 2x1GE bonded network interface for interconnects
>> * a 2x1GE bonded network interface for external access

Multiple NICs on a box can sometimes cause big performance problems with Hadoop.  So watch your traffic carefully.