Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> one RegionServer crashed and the whole cluster was blocked


Copy link to this message
-
RE: one RegionServer crashed and the whole cluster was blocked
>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?
Ideally should not.  But if your client was generating data for the regions
that were dead at that time then client requests willnot be served till the
regions are online after
Log splitting on some other region server.
Any client requests going to other region servers should ideally be working.
Did you see the threaddumps at that time on the other RS? That should give
some clue.

>   For 2, Is there any solution to reduce the recovery time?
The recovery time depends on the amount of data and particularly on the size
of the HLog file.  By default every HLog file is of size 256MB.
In 0.94.0 some good no of changes have gone in to make the recovery faster
in terms of HLog Splitting.
> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
I am not very sure how the code works with this property.  Will check this
part.

Regards
Ram

> -----Original Message-----
> From: 张磊 [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, October 18, 2012 5:01 PM
> To: [EMAIL PROTECTED]
> Subject: one RegionServer crashed and the whole cluster was blocked
>
> Hi, All
>
>   One of the RegionServer of our company’s cluster was crashed. At this
> time, I found:
>
> 1.       All the RegionServer stopped handling the requests from the
> client
> side( requestsPerSecond=0 at the master-status UI page).
>
> 2.       It takes about 12-15 minutes to recovery.
>
> 3.       I have set hbase.regionserver.restart.on.zk.expire to true,
> but it
> does not work.
>
>   For 1, I knew the cluster began to split log and recover the data on
> the
> crashed RegionServer, will the recovery operation block all the
> requests
> from the client side?
>
>   For 2, Is there any solution to reduce the recovery time?
>
>   For 3, I checked the log, found “session is timeout” exception, maybe
> for full gc and the session was timeout. But why the
> hbase.regionserver.restart.on.zk.expire does not work? My HBase version
> is
> 0.94.0.
>
>
>
>   Thanks for any suggestions and feedback!
>
>
>
> Fowler Zhang
>
>