Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - hbase mttr vs. hdfs

N Keywal 2012-07-12, 21:20
Copy link to this message
Re: hbase mttr vs. hdfs
Todd Lipcon 2012-07-12, 21:24
Hey Nicolas,

Another idea that might be able to help this without adding an entire
new state to the protocol would be to just improve the HDFS client
side in a few ways:

1) change the "deadnodes" cache to be a per-DFSClient structure
instead of per-stream. So, after reading one block, we'd note that the
DN was dead, and de-prioritize it on future reads. Of course we'd need
to be able to re-try eventually since dead nodes do eventually
2) when connecting to a DN, if the connection hasn't succeeded within
1-2 seconds, start making a connection to another replica. If the
other replica succeeds first, then drop the connection to the first
(slow) node.

Wouldn't this solve the problem less invasively?


On Thu, Jul 12, 2012 at 2:20 PM, N Keywal <[EMAIL PROTECTED]> wrote:
> Hi,
> I have looked at the HBase MTTR scenario when we lose a full box with
> its datanode and its hbase region server altogether: It means a RS
> recovery, hence reading the logs files and writing new ones (splitting
> logs).
> By default, HDFS considers a DN as dead when there is no heartbeat for
> 10:30 minutes. Until this point, the NaneNode will consider it as
> perfectly valid and it will get involved in all read & write
> operations.
> And, as we lost a RegionServer, the recovery process will take place,
> so we will read the WAL & write new log files. And with the RS, we
> lost the replica of the WAL that was with the DN of the dead box. In
> other words, 33% of the DN we need are dead. So, to read the WAL, per
> block to read and per reader, we've got one chance out of 3 to go to
> the dead DN, and to get a connect or read timeout issue. With a
> reasonnable cluster and a distributed log split, we will have a sure
> winner.
> I looked in details at the hdfs configuration parameters and their
> impacts. We have the calculated values:
> heartbeat.interval = 3s ("dfs.heartbeat.interval").
> heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
> heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes
> At least on 1.0.3, there is no shutdown hook to tell the NN to
> consider this DN as dead, for example on a software crash.
> So before the 10:30 minutes, the DN is considered as fully available
> by the NN.  After this delay, HDFS is likely to start replicating the
> blocks contained in the dead node to get back to the right number of
> replica. As a consequence, if we're too aggressive we will have a side
> effect here, adding workload to an already damaged cluster. According
> to Stack: "even with this 10 minutes wait, the issue was met in real
> production case in the past, and the latency increased badly". May be
> there is some tuning to do here, but going under these 10 minutes does
> not seem to be an easy path.
> For the clients, they don't fully rely on the NN feedback, and they
> keep, per stream, a dead node list. So for a single file, a given
> client will do the error once, but if there are multiple files it will
> go back to the wrong DN. The settings are:
> connect/read:  (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
> write: (5s (hardcoded) * NumberOfReplica) + 480s
> ("dfs.datanode.socket.write.timeout")
> That will set a 69s timeout to get a "connect" error with the default config.
> I also had a look at larger failure scenarios, when we're loosing a
> 20% of a cluster. The smaller the cluster is the easier it is to get
> there. With the distributed log split, we're actually on a better
> shape from an hdfs point of view: the master could have error writing
> the files, because it could bet a dead DN 3 times in a row. If the
> split is done by the RS, this issue disappears. We will however get a
> lot of errors between the nodes.
> Finally, I had a look at the lease stuff Lease: write access lock to a
> file, no other client can write to the file. But another client can
> read it. Soft lease limit: another client can preempt the lease.
> Configurable.
> Default
Todd Lipcon
Software Engineer, Cloudera
N Keywal 2012-07-12, 22:16
N Keywal 2012-07-13, 07:53
N Keywal 2012-07-13, 13:27
Andrew Purtell 2012-07-13, 16:31
Stack 2012-07-18, 09:36
lars hofhansl 2012-07-13, 17:11
Ted Yu 2012-07-13, 17:18
N Keywal 2012-07-13, 17:46
N Keywal 2012-07-16, 12:00
N Keywal 2012-07-16, 17:08
Stack 2012-07-18, 10:00
Ted Yu 2012-07-13, 17:02
Stack 2012-07-17, 15:41
N Keywal 2012-07-17, 17:14
Andrew Purtell 2012-07-17, 19:51
Stack 2012-07-18, 09:26