Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - hbase mttr vs. hdfs


+
N Keywal 2012-07-12, 21:20
Copy link to this message
-
Re: hbase mttr vs. hdfs
Todd Lipcon 2012-07-12, 21:24
Hey Nicolas,

Another idea that might be able to help this without adding an entire
new state to the protocol would be to just improve the HDFS client
side in a few ways:

1) change the "deadnodes" cache to be a per-DFSClient structure
instead of per-stream. So, after reading one block, we'd note that the
DN was dead, and de-prioritize it on future reads. Of course we'd need
to be able to re-try eventually since dead nodes do eventually
restart.
2) when connecting to a DN, if the connection hasn't succeeded within
1-2 seconds, start making a connection to another replica. If the
other replica succeeds first, then drop the connection to the first
(slow) node.

Wouldn't this solve the problem less invasively?

-Todd

On Thu, Jul 12, 2012 at 2:20 PM, N Keywal <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have looked at the HBase MTTR scenario when we lose a full box with
> its datanode and its hbase region server altogether: It means a RS
> recovery, hence reading the logs files and writing new ones (splitting
> logs).
>
> By default, HDFS considers a DN as dead when there is no heartbeat for
> 10:30 minutes. Until this point, the NaneNode will consider it as
> perfectly valid and it will get involved in all read & write
> operations.
>
> And, as we lost a RegionServer, the recovery process will take place,
> so we will read the WAL & write new log files. And with the RS, we
> lost the replica of the WAL that was with the DN of the dead box. In
> other words, 33% of the DN we need are dead. So, to read the WAL, per
> block to read and per reader, we've got one chance out of 3 to go to
> the dead DN, and to get a connect or read timeout issue. With a
> reasonnable cluster and a distributed log split, we will have a sure
> winner.
>
>
> I looked in details at the hdfs configuration parameters and their
> impacts. We have the calculated values:
> heartbeat.interval = 3s ("dfs.heartbeat.interval").
> heartbeat.recheck.interval = 300s ("heartbeat.recheck.interval")
> heartbeatExpireInterval = 2 * 300 + 10 * 3 = 630s => 10.30 minutes
>
> At least on 1.0.3, there is no shutdown hook to tell the NN to
> consider this DN as dead, for example on a software crash.
>
> So before the 10:30 minutes, the DN is considered as fully available
> by the NN.  After this delay, HDFS is likely to start replicating the
> blocks contained in the dead node to get back to the right number of
> replica. As a consequence, if we're too aggressive we will have a side
> effect here, adding workload to an already damaged cluster. According
> to Stack: "even with this 10 minutes wait, the issue was met in real
> production case in the past, and the latency increased badly". May be
> there is some tuning to do here, but going under these 10 minutes does
> not seem to be an easy path.
>
> For the clients, they don't fully rely on the NN feedback, and they
> keep, per stream, a dead node list. So for a single file, a given
> client will do the error once, but if there are multiple files it will
> go back to the wrong DN. The settings are:
>
> connect/read:  (3s (hardcoded) * NumberOfReplica) + 60s ("dfs.socket.timeout")
> write: (5s (hardcoded) * NumberOfReplica) + 480s
> ("dfs.datanode.socket.write.timeout")
>
> That will set a 69s timeout to get a "connect" error with the default config.
>
> I also had a look at larger failure scenarios, when we're loosing a
> 20% of a cluster. The smaller the cluster is the easier it is to get
> there. With the distributed log split, we're actually on a better
> shape from an hdfs point of view: the master could have error writing
> the files, because it could bet a dead DN 3 times in a row. If the
> split is done by the RS, this issue disappears. We will however get a
> lot of errors between the nodes.
>
> Finally, I had a look at the lease stuff Lease: write access lock to a
> file, no other client can write to the file. But another client can
> read it. Soft lease limit: another client can preempt the lease.
> Configurable.
> Default
Todd Lipcon
Software Engineer, Cloudera
+
N Keywal 2012-07-12, 22:16
+
N Keywal 2012-07-13, 07:53
+
N Keywal 2012-07-13, 13:27
+
Andrew Purtell 2012-07-13, 16:31
+
Stack 2012-07-18, 09:36
+
lars hofhansl 2012-07-13, 17:11
+
Ted Yu 2012-07-13, 17:18
+
N Keywal 2012-07-13, 17:46
+
N Keywal 2012-07-16, 12:00
+
N Keywal 2012-07-16, 17:08
+
Stack 2012-07-18, 10:00
+
Ted Yu 2012-07-13, 17:02
+
Stack 2012-07-17, 15:41
+
N Keywal 2012-07-17, 17:14
+
Andrew Purtell 2012-07-17, 19:51
+
Stack 2012-07-18, 09:26