On Mon, Feb 17, 2014 at 1:59 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote:

You could down the timeouts so that the bad disk didn't hold up hbase
reads/writes for so long.

Your disks are just dying?  They are not degrading first?

That was local DN?  You think we moved on to the other replica so
read/writes progressed afterward?  You might want to tinker w/ some of your
timeouts to make them fail over replicas faster.
Latencies were going up before this on this node?  You monitor your disks?
 Any increase in reported errors?  Complaints in dmesg, etc. that could
have given you some forenotice?

This is because handlers got backed up unable to write out their load to

60 seconds is a long time to wait on data.  Tune it down?
A thread had the row lock and was stuck on HDFS?  Any other thread that
came in would timeout trying to get to the row?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB