Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Master stuck trying to rebalance regions to a zombie region servers.


Copy link to this message
-
Master stuck trying to rebalance regions to a zombie region servers.
Pradeep Gopaluni 2012-06-12, 11:05
Hi,

We are using HBASE 0.90.3 and Hadoop 0.20.205 for our cluster. Today, we
have observed that the HBASE master was stuck because of  zombie
regionservers that was responding to ping but was stuck.

The logs show that the master tried to re-assign its regions from an
overloaded (ServerA) and one of the regions to be rebalanced is designated
for the server in zombie state (SeverB). This re-assignment started failing
with SocketTimeoutException, and the master was stuck waiting. This got
resolved only when ServerB was taken out of network. There were no master
logs printed for close to 30 mins, between the attempted reassignment, and
the time ServerB removed from network.

The issue of infinite wait seems to be very similar to the hadoop RPC issue
- https://issues.apache.org/jira/browse/HADOOP-6889 reported and fixed in
0.20.205.
Did anyone face the same problem, was this fixed in later versions of HBASE?

--
Regards
pradeep

*2012-06-07 06:59:32,693 HbaseMaster.domain.com:60000-BalancerChore INFO
org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in
1ms. Moving 26 regions off of 24 overloaded servers onto 0 less loaded
servers
.
2012-06-07 06:59:32,746 HbaseMaster.domain.com:60000-BalancerChore INFO
org.apache.hadoop.hbase.master.HMaster: balance
hri=users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac338bf6d504ef.,
src=ServerA.domain.com <http://servera.domain.com/>,60020,1338788540876,
dest=ServerB.domain.com <http://serverb.domain.com/>,60020,1339042481685
.
.
2012-06-07 06:59:53,854 MASTER_CLOSE_REGION-HbaseMaster.domain.com:60000-3
WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
users,XePh1ThgkEgb1LJG8rWWnQ==,1327804993156.b935614a50366924dfac338bf6d504ef.
to serverName=ServerB.domain.com
<http://serverb.domain.com/>,60020,1339042481685,
load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to assign
elsewhere instead; retry=0
java.net.SocketTimeoutException: 20000 millis timeout while waiting for
channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/
10.193.190.107:60020]
        at
org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:604)
        at
org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:328)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:883)
        at
org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:750)
        at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:257)
        at $Proxy7.openRegion(Unknown Source)
        at
org.apache.hadoop.hbase.master.ServerManager.sendRegionOpen(ServerManager.java:559)
        at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:934)
        at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:749)
        at
org.apache.hadoop.hbase.master.AssignmentManager.assign(AssignmentManager.java:729)
        at
org.apache.hadoop.hbase.master.handler.ClosedRegionHandler.process(ClosedRegionHandler.java:92)
        at
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
        at java.lang.Thread.run(Thread.java:662)
2012-06-07 07:01:31,884 908082002@qtp-1543904466-10 INFO
org.apache.zookeeper.ZooKeeper: Initiating client connection, connectStringzaas3.domain.com:2181,zaas2.domain.com:2181,zaas1.domain.com:2181,
zaas5.domain.com:2181,zaas4.domain.com:2181 sessionTimeout=300000
watcher=hconnection
2012-06-07 07:01:31,884 908082002@qtp-1543904466-10-SendThread() INFO
org.apache.zookeeper.ClientCnxn: Opening socket connection to server
zaas1.domain.com/10.193.190.117:2181
2012-06-07 07:01:31,885 908082002@qtp-1543904466-10-SendThread(
zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to zaas1.domain.com/10.193.190.117:2181, initiating
session
2012-06-07 07:01:31,887 908082002@qtp-1543904466-10-SendThread(
zaas1.domain.com:2181) INFO org.apache.zookeeper.ClientCnxn: Session
establishment complete on server zaas1.domain.com/10.193.190.117:2181,
sessionid = 0x133cce032e46274, negotiated timeout = 300000
2012-06-07 07:01:33,695 908082002@qtp-1543904466-10 INFO
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation:
Closed zookeeper sessionid=0x133cce032e46274
2012-06-07 07:01:33,697 908082002@qtp-1543904466-10 INFO
org.apache.zookeeper.ZooKeeper: Session: 0x133cce032e46274 closed
2012-06-07 07:01:33,697 908082002@qtp-1543904466-10-EventThread INFO
org.apache.zookeeper.ClientCnxn: EventThread shut down
2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore INFO
org.apache.hadoop.hbase.master.LoadBalancer: Calculated a load balance in
1ms. Moving 1 regions off of 1 overloaded servers onto 0 less loaded servers
2012-06-07 07:04:32,738 HbaseMaster.domain.com:60000-BalancerChore INFO
org.apache.hadoop.hbase.master.HMaster: balance
hri=users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a51fb3006a5.,
src=mdg357.domain.com,60020,1338788541085,
dest=ServerB.domain.com<http://serverb.domain.com/>
,60020,1339042481685
2012-06-07 07:04:53,290 MASTER_CLOSE_REGION-HbaseMaster.domain.com:60000-0
WARN org.apache.hadoop.hbase.master.AssignmentManager: Failed assignment of
users,iLy9s53ke7L+6gQ9ASdzbA==,1327771137166.d75ed55fbfd7c63ab3fa3a51fb3006a5.
to serverName=ServerB.domain.com
<http://serverb.domain.com/>,60020,1339042481685,
load=(requests=0, regions=57, usedHeap=30, maxHeap=3998), trying to assign
elsewhere instead; retry=0
java.net.SocketTimeoutException: 20000 millis timeout while waiting for
channel to be ready for connect. ch :
java.nio.channels.SocketChannel[connection-pending remote=/
10.193.190.107:6002