Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> mttr update


Thanks for the nice update N.

Regards
Ram

> -----Original Message-----
> From: n keywal [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, September 12, 2012 12:27 AM
> To: [EMAIL PROTECTED]
> Subject: mttr update
>
> Hi All,
>
> There is some progress on MTTR. It's detailed in HBASE-5843, but here
> is a
> synthesis.
> That's the server side view, including the edits split & replaying, and
> considering a timeout of 30s in ZK.
>
> 1) Region Server crash. We can expect 50s on 0.94; 20s on 0.96:
>  - the failure will be detected immediately on 0.96, after 30s on 0.94
>  - the distributed split seems to work well (i.e. distribute well)
>  - the assignment seems to be dominated by replaying the locally edits,
> and
> should scale well on a reasonable cluster.
>
> 2) Single Box failure (regionserver + datanode): 0.94: often around 10
> minutes. 0.96 (actually HDFS-3703): 50s.
>  - It's random. The more data to split, the more chance you have to be
> directed to the dead datanode. With little data in the memstore, it's
> like
> 1).
>  - The results come from HDFS-3703: we're not directed to the dead
> datanodes anymore. It's not yet in the official hdfs release.
>  - When directed to a dead datanode, HBase/HDFS retries on the same
> datanode instead of moving to another one (HBASE-6751)
>  - Distributed Split resubmits the tasks too fast (HBASE-6738)
>
> 3) Going further:
>  - 3703 simplifies a lot of things, because we've got much less errors
> from
> the underlying file system when a box dies. So in production it's gonna
> be
> quite useful in many cases. It would be dangerous to rely too much on
> it,
> i.e. being non consistent or totally inefficient when we've got
> datanode
> errors. HBASE-6738 is a good example: when there is no datanode error
> it
> does no show up; it does not mean we don't have a problem.
>  - There are still the nasty cases, i.e. loosing meta/root, or mixing a
> failure with a heavy workload (workload increases during failure) and
> many
> other things like this.
>  - For reliability and safety, not writing the log locally could be
> important. That's HDFS-3706.
>  - These tests are from the server point if view. There could be corner
> cases if looked at from a client point of view.
>  - And we could do things differently to serve writes and some reads
> immediately (HBASE-6752)
>  - Decreasing the detection time will become more and more important.
> (HBASE-6290, ZOOKEEPER-702, ZOOKEEPER-922, ...)
>
>  That's all folks! :-)
>
>  Nicolas
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB