-heartbeat and timeout question
Patai Sangbutsarakum 2013-05-22, 00:47
I am going to migrate production racks of datanodes/tasktrackers into new
core switches. Rack awareness is in place for long time. I am looking for
the way to mitigate recopying blocks of datanodes in the rack that is being
move (when it become dead nodes), and shifting of running tasks in those
tasktrackers to other machines.
One approach, that i can thinking of is playing with heartbeat of both
datanode and tasktracker to make it extra long like 15 minutes, so namenode
and jobtracker are more forgiving to those nodes (that is being moved).
however, network operation that need to be done to flip the switch should
be around couple minutes per rack.
Possible alternatives are more than welcome.
Thanks in advnace,
btw, the cluster is on cdh3u4 (0.20 branch)