|
|
-
one RegionServer crashed and the whole cluster was blocked
张磊 2012-10-18, 11:30
Hi, All
One of the RegionServer of our company’s cluster was crashed. At this time, I found:
1. All the RegionServer stopped handling the requests from the client side( requestsPerSecond=0 at the master-status UI page).
2. It takes about 12-15 minutes to recovery.
3. I have set hbase.regionserver.restart.on.zk.expire to true, but it does not work.
For 1, I knew the cluster began to split log and recover the data on the crashed RegionServer, will the recovery operation block all the requests from the client side?
For 2, Is there any solution to reduce the recovery time?
For 3, I checked the log, found “session is timeout” exception, maybe for full gc and the session was timeout. But why the hbase.regionserver.restart.on.zk.expire does not work? My HBase version is 0.94.0.
Thanks for any suggestions and feedback!
Fowler Zhang
-
RE: one RegionServer crashed and the whole cluster was blocked
Ramkrishna.S.Vasudevan 2012-10-18, 12:15
> For 1, I knew the cluster began to split log and recover the data on > the > crashed RegionServer, will the recovery operation block all the > requests > from the client side? Ideally should not. But if your client was generating data for the regions that were dead at that time then client requests willnot be served till the regions are online after Log splitting on some other region server. Any client requests going to other region servers should ideally be working. Did you see the threaddumps at that time on the other RS? That should give some clue.
> For 2, Is there any solution to reduce the recovery time? The recovery time depends on the amount of data and particularly on the size of the HLog file. By default every HLog file is of size 256MB. In 0.94.0 some good no of changes have gone in to make the recovery faster in terms of HLog Splitting. > 3. I have set hbase.regionserver.restart.on.zk.expire to true, > but it > does not work. I am not very sure how the code works with this property. Will check this part.
Regards Ram
> -----Original Message----- > From: 张磊 [mailto:[EMAIL PROTECTED]] > Sent: Thursday, October 18, 2012 5:01 PM > To: [EMAIL PROTECTED] > Subject: one RegionServer crashed and the whole cluster was blocked > > Hi, All > > One of the RegionServer of our company’s cluster was crashed. At this > time, I found: > > 1. All the RegionServer stopped handling the requests from the > client > side( requestsPerSecond=0 at the master-status UI page). > > 2. It takes about 12-15 minutes to recovery. > > 3. I have set hbase.regionserver.restart.on.zk.expire to true, > but it > does not work. > > For 1, I knew the cluster began to split log and recover the data on > the > crashed RegionServer, will the recovery operation block all the > requests > from the client side? > > For 2, Is there any solution to reduce the recovery time? > > For 3, I checked the log, found “session is timeout” exception, maybe > for full gc and the session was timeout. But why the > hbase.regionserver.restart.on.zk.expire does not work? My HBase version > is > 0.94.0. > > > > Thanks for any suggestions and feedback! > > > > Fowler Zhang > >
-
Re: one RegionServer crashed and the whole cluster was blocked
Nicolas Liochon 2012-10-18, 12:55
Hi,
Some stuff below:
On Thu, Oct 18, 2012 at 1:30 PM, 张磊 <[EMAIL PROTECTED]> wrote:
> Hi, All > > One of the RegionServer of our company’s cluster was crashed. At this > time, I found: > > 1. All the RegionServer stopped handling the requests from the client > side( requestsPerSecond=0 at the master-status UI page). > > 2. It takes about 12-15 minutes to recovery. > > 3. I have set hbase.regionserver.restart.on.zk.expire to true, but it > does not work. > > For 1, I knew the cluster began to split log and recover the data on the > crashed RegionServer, will the recovery operation block all the requests > from the client side? >
No. But it's worth checking that the region server who died was not the one handling the .meta. region. If it's the case, it's could be an explanation (clients do have a cache, but for first time access to a region they go to the .meta. region first.) > For 2, Is there any solution to reduce the recovery time? >
12 minutes for a single region server crash (i.e. the datanode it still there, the cluster is ok) seems huge. You need to look at: - a possible root cause: if the region server got disconnected, it may be because the network or ZooKeeper was in the bad shape anyway. So the recovery is slow because the cause of the crash is still there. - how is your cluster? Do you have a a lot of regions to recover? Did you have a lot of writes on this region server? > For 3, I checked the log, found “session is timeout” exception, maybe > for full gc and the session was timeout. But why the > hbase.regionserver.restart.on.zk.expire does not work? My HBase version is > 0.94.0. >
I'm not sure it's still in the code base. To be checked. As well, you can have a root cause that makes the server stops. But there are two sides of a ZK disconnect anyway: 1) the region server: if it's disconnected but actually still there so it may decide to kill itself, or not. 2) the cluster: after the timeout, the timeouted regionserver is considered as dead and the recovery starts. This whatever what happens in 1). So whatever happens in 1) does not change much from a mttr point of view, except if your cluster is small, or if your loosing multiple nodes.
There is an autorestart option in the 0.96 scripts. It changes nothing to the mttr itself, but cover more cases of regionserver crashes. See releases notes in HBASE-5939.
Good luck,
Nicolas
|
|