|
|
-
balancer stopped working
Matthias Hofschen 2011-11-16, 10:40
Hi, we had a case today where the loadbalancer stopped working. (cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved to another node. Shortly thereafter the regionserver of that region was stopped. In the master logs we see that master is trying to contact this regionserver to move the region. At the same time we had about 1000 regions missing because of the stopped regionserver. These where not assigned to another regionserver. I suspect that the the load balancer on the master machine was blocked by trying to move the one region which was not possible. We then restarted the master which solved the problem.
Is this a known problem, should I log an issue? I do have a stacktrace from the master machine.
Cheers Matthias
-
RE: balancer stopped working
Ramkrishna S Vasudevan 2011-11-16, 11:43
Can you share the logs to confirm things?
Regards Ram
-----Original Message----- From: Matthias Hofschen [mailto:[EMAIL PROTECTED]] Sent: Wednesday, November 16, 2011 4:10 PM To: [EMAIL PROTECTED] Subject: balancer stopped working
Hi, we had a case today where the loadbalancer stopped working. (cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved to another node. Shortly thereafter the regionserver of that region was stopped. In the master logs we see that master is trying to contact this regionserver to move the region. At the same time we had about 1000 regions missing because of the stopped regionserver. These where not assigned to another regionserver. I suspect that the the load balancer on the master machine was blocked by trying to move the one region which was not possible. We then restarted the master which solved the problem.
Is this a known problem, should I log an issue? I do have a stacktrace from the master machine.
Cheers Matthias
-
Re: balancer stopped working
yuzhihong@... 2011-11-16, 11:52
Can you post the stack trace and relevant log snippets ?
Thanks
On Nov 16, 2011, at 2:40 AM, Matthias Hofschen <[EMAIL PROTECTED]> wrote:
> Hi, > we had a case today where the loadbalancer stopped working. > (cloudera-cdh3-u1, 52nodes). Basically we had a hot region that we moved to > another node. Shortly thereafter the regionserver of that region was > stopped. In the master logs we see that master is trying to contact this > regionserver to move the region. At the same time we had about 1000 regions > missing because of the stopped regionserver. These where not assigned to > another regionserver. I suspect that the the load balancer on the master > machine was blocked by trying to move the one region which was not > possible. We then restarted the master which solved the problem. > > Is this a known problem, should I log an issue? I do have a stacktrace from > the master machine. > > Cheers Matthias
-
Re: balancer stopped working
Matthias Hofschen 2011-11-16, 12:27
Hi this is the stacktrace of the master process till restart:
2011-11-16 09:28:20,007 INFO org.apache.hadoop.hbase.master.LoadBalancer: Skipping load balancing. servers=51 regions=76793 average=1505.7451 mostloaded=1506 leastloaded=1506 2011-11-16 09:29:51,276 INFO org.apache.hadoop.hbase.master.ServerManager: Registering server=hdp1156,60020,1321432190747, regionCount=0, userLoad=false 2011-11-16 09:34:35,646 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: message,99086_700177012,1302074416350.1895561433 state=PENDING_CLOSE, ts=1321432287936 2011-11-16 09:34:35,646 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=message,99086_700177012,1302074416350.1895561433 2011-11-16 09:34:35,647 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657, usedHeap=1380, maxHeap=11897) returned java.net.ConnectException: Connection refused for 1895561433 2011-11-16 09:37:35,648 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: message,99086_700177012,1302074416350.1895561433 state=PENDING_CLOSE, ts=1321432475647 2011-11-16 09:43:35,651 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=message,99086_700177012,1302074416350.1895561433 2011-11-16 09:43:35,652 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657, usedHeap=1380, maxHeap=11897) returned java.net.ConnectException: Connection refused for 1895561433 2011-11-16 09:46:35,653 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: message,99086_700177012,1302074416350.1895561433 state=PENDING_CLOSE, ts=1321433015652 2011-11-16 09:49:35,656 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=message,99086_700177012,1302074416350.1895561433 2011-11-16 09:49:35,657 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657, usedHeap=1380, maxHeap=11897) returned java.net.ConnectException: Connection refused for 1895561433 2011-11-16 09:49:36,305 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [hdp1105,60020,1321217721376] 2011-11-16 09:49:36,305 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for hdp1105,60020,1321217721376 2011-11-16 09:49:36,309 ERROR org.apache.hadoop.hbase.executor.EventHandler: Caught throwable while processing event M_SERVER_SHUTDOWN java.lang.NullPointerException at org.apache.hadoop.hbase.master.AssignmentManager.processServerShutdown(AssignmentManager.java:1781) at org.apache.hadoop.hbase.master.handler.ServerShutdownHandler.process(ServerShutdownHandler.java:101) at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:156) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) 2011-11-16 09:52:35,657 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: message,99086_700177012,1302074416350.1895561433 state=PENDING_CLOSE, ts=1321433375656 2011-11-16 09:52:35,657 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=message,99086_700177012,1302074416350.1895561433 2011-11-16 09:52:35,658 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657, usedHeap=1380, maxHeap=11897) returned java.net.ConnectException: Connection refused for 1895561433 2011-11-16 09:54:58,681 INFO org.apache.hadoop.hbase.master.HMaster: Balance=true 2011-11-16 09:55:35,659 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: message,99086_700177012,1302074416350.1895561433 state=PENDING_CLOSE, ts=1321433555658 2011-11-16 09:58:35,661 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=message,99086_700177012,1302074416350.1895561433 2011-11-16 09:58:35,667 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657, usedHeap=1380, maxHeap=11897) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for message,99086_700177012,1302074416350.1895561433 but we are not serving it for 1895561433 2011-11-16 10:01:35,663 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transition timed out: message,99086_700177012,1302074416350.1895561433 state=PENDING_CLOSE, ts=1321433915662 2011-11-16 10:01:35,663 INFO org.apache.hadoop.hbase.master.AssignmentManager: Region has been PENDING_CLOSE for too long, running forced unassign again on region=message,99086_700177012,1302074416350.1895561433 2011-11-16 10:01:35,668 INFO org.apache.hadoop.hbase.master.AssignmentManager: Server serverName=hdp1105,60020,1321217721376, load=(requests=114, regions=657, usedHeap=1380, maxHeap=11897) returned org.apache.hadoop.hbase.NotServingRegionException: org.apache.hadoop.hbase.NotServingRegionException: Received close for message,99086_700177012,1302074416350.1895561433 but we are not serving it for 1895561433 2011-11-16 10:02:35,647 INFO org.apache.hadoop.hbase.master.HMaster: Balance=true 2011-11-16 10:04:35,665 INFO org.apache.hadoop.hbase.master.AssignmentManager: Regions in transitio
|
|