Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBaseClient recovery from .META. server power down


Copy link to this message
-
Re: HBaseClient recovery from .META. server power down
By "power down" below, I mean powering down the host with the RS that
holds the .META. table. (So - essentially, the host IP is unreachable
and the RS/DN is gone.)

Just wanted to clarify my below steps ...
--S

On Mon, Jul 2, 2012 at 5:36 PM, Suraj Varma <[EMAIL PROTECTED]> wrote:
> Hello:
> We've been doing some failure scenario tests by powering down a .META.
> holding region server host and while the HBase cluster itself recovers
> and reassigns the META region and other regions (after we tweaked down
> the default timeouts), our client apps using HBaseClient take a long
> time to recover.
>
> hbase-0.90.6 / cdh3u4 / JDK 1.6.0_23
>
> Process:
> 1) Apply load via client app on HBase cluster for several minutes
> 2) Power down the region server holding the .META. server
> 3) Measure how long it takes for cluster to reassign META table and
> for client threads to re-lookup and re-orient to the lesser cluster
> (minus the RS and DN on that host).
>
> What we see:
> 1) Client threads spike up to maxThread size ... and take over 35 mins
> to recover (i.e. for the thread count to go back to normal) - no calls
> are being serviced - they are all just backed up on a synchronized
> method ...
>
> 2) Essentially, all the client app threads queue up behind the
> HBaseClient.setupIOStreams method in oahh.ipc.HBaseClient
> (http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hbase/hbase/0.90.2/org/apache/hadoop/hbase/ipc/HBaseClient.java#312).
> http://tinyurl.com/7js53dj
>
> After taking several thread dumps we found that the thread within this
> synchronized method was blocked on
>    NetUtils.connect(this.socket, remoteId.getAddress(), getSocketTimeout(conf));
>
> Essentially, the thread which got the lock would try to connect to the
> dead RS (till socket times out), retrying, and then the next thread
> gets in and so forth.
>
> Solution tested:
> -------------------
> So - the ipc.HBaseClient code shows ipc.socket.timeout default is 20s.
> We dropped this down to a low number (1000 ms,  100 ms, etc) and the
> recovery was much faster (in a couple of minutes).
>
> So - we're thinking of setting the HBase client side hbase-site.xml
> with an ipc.socket.timeout of 100ms. Looking at the code, it appears
> that this is only ever used during the initial "HConnection" setup via
> the NetUtils.connect and should only ever be used when connectivity to
> a region server is lost and needs to be re-established. i.e it does
> not affect the normal "RPC" actiivity as this is just the connect
> timeout.
>
> Am I reading the code right? Any thoughts on how whether this is too
> low for comfort? (Our internal tests did not show any errors during
> normal operation related to timeouts etc ... but, I just wanted to run
> this by the experts.).
>
> Note that this above timeout tweak is only on the HBase client side.
> Thanks,
> --Suraj