|
Garrett Wu
2011-04-29, 04:48
Ted Dunning
2011-04-29, 06:35
Michel Segel
2011-04-29, 11:36
Stack
2011-04-29, 16:13
Garrett Wu
2011-04-29, 22:12
|
-
debugging a possible long garbage collectionGarrett Wu 2011-04-29, 04:48
Our HBase master and region servers all came down during a quiet period
today (no read or write traffic). After inspecting the logs, I think I've pieced together what might have happened: 1. One of the region servers (hadoop11) timed out and started closing it's regions and aborting. 2. The master started splitting logs for the hadoop11 region. 3. While working on the first HLog, something causes the master to hang for 4.5 minutes. 4. This was longer than the 180-second default zookeeper timeout, so the master receives a zk session expired. 5. The master aborts. Some snippets from the logs are pasted below. Does anyone know what may have caused this? Was the hang really a garbage collection? How might I prevent this from happening again? Thanks, Garrett // From the master (hadoop01) 2011-04-28 07:59:24,004 INFO org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer ephemeral node deleted, processing expiration [hadoop11,60020,1302898865706] 2011-04-28 07:59:24,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: Added=hadoop11,60020,1302898865706 to dead servers, submitted shutdown handler to be executed, root=false, meta=false 2011-04-28 07:59:24,005 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for hadoop11,60020,1302898865706 [ ... ] 2011-04-28 07:59:24,984 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://hadoop01/hbase/processed/03819443e22488ec15418f22c67f02f5/recovered.edits/0000000000060056396 region=03819443e22488ec15418f22c67f02f5 2011-04-28 07:59:25,101 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2011-04-28 07:59:25,101 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer path=hdfs://hadoop01/hbase/processed/c56e10565f9e97fd19fb0879f59fd8b0/recovered.edits/0000000000060060517 region=c56e10565f9e97fd19fb0879f59fd8b0 [ // oh no! a 3 minute gap! ] 2011-04-28 08:03:58,571 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=77384 entries from hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174692580 2011-04-28 08:03:58,572 DEBUG org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 2 of 29: hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174703672, length=63864464 [ ... ] 2011-04-28 08:03:59,055 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 273689ms instead of 1000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 [ ... ] 2011-04-28 08:04:00,261 FATAL org.apache.hadoop.hbase.master.HMaster: master:60000-0x12f4c26d7652910 master:60000-0x12f4c26d7652910 received expired from ZooKeeper, aborting org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2011-04-28 08:04:00,261 INFO org.apache.hadoop.hbase.master.HMaster: Aborting // From the regionserver (hadoop11) 2011-04-28 07:41:06,851 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB, free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675, hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675, cachingHitsRatio=4.29%%, evictions=528, evicted=957896, evictedPerRun=1814.197021484375 2011-04-28 07:46:06,851 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB, free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675, hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675, cachingHitsRatio=4.29%%, evictions=528, evicted=957896, evictedPerRun=1814.197021484375 2011-04-28 07:51:06,852 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB, free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675, hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675, cachingHitsRatio=4.29%%, evictions=528, evicted=957896, evictedPerRun=1814.197021484375 2011-04-28 07:59:39,198 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 285991ms for sessionid 0x12f4c26d7652917, closing socket connection and attempting reconnect 2011-04-28 07:59:40,600 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server hadoop01/172.30.160.15:2181 2011-04-28 07:59:40,601 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to hadoop01/172.30.160.15:2181, initiating session 2011-04-28 07:59:40,602 DEBUG org.apache.hadoop.hbase.regionserver.LogRoller: Hlog roll period 3600000ms elapsed 2011-04-28 07:59:40,652 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server hadoop01/172.30.160.15:2181, sessionid 0x12f4c26d7652917, negotiated timeout = 40000 2011-04-28 07:59:41,288 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 285993ms for sessionid 0x12f4c26d7652920, closing socket connection and attempting reconnect [ ... ] 2011-04-28 07:59:23,174 INFO org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation: Reconnected successfully. This disconnect could have been caused by a network partition or a long-running GC pause, either way it's recommended that you verify your environment. 2011-04-28 07:59:23,174 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down 2011-04-28 07:59:24,004 INFO org.apache.zookeeper.ClientCnxn: Unable to read additional data from server sessionid 0x12f4c26d7652920, likely server has closed socket, closing socket connection and attempting reconnect 2011-04-28 07:59:25
-
Re: debugging a possible long garbage collectionTed Dunning 2011-04-29, 06:35
Swap and gc are the usual culprits for this.
Are you running a recent enough version to have Todd's wondrous mslab option? On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu <[EMAIL PROTECTED]> wrote: > Some snippets from the logs are pasted below. Does anyone know what may > have caused this? Was the hang really a garbage collection? How might I > prevent this from happening again? >
-
Re: debugging a possible long garbage collectionMichel Segel 2011-04-29, 11:36
+1 on Todd's GC stuff. I just implemented it and it looks like there's a bit of improvement. I'm still early on in my testing...
Sent from a remote device. Please excuse any typos... Mike Segel On Apr 29, 2011, at 1:35 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Swap and gc are the usual culprits for this. > > Are you running a recent enough version to have Todd's wondrous mslab > option? > > On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu <[EMAIL PROTECTED]> wrote: > >> Some snippets from the logs are pasted below. Does anyone know what may >> have caused this? Was the hang really a garbage collection? How might I >> prevent this from happening again? >>
-
Re: debugging a possible long garbage collectionStack 2011-04-29, 16:13
On Thu, Apr 28, 2011 at 9:48 PM, Garrett Wu <[EMAIL PROTECTED]> wrote:
> 2. The master started splitting logs for the hadoop11 region. Splitting logs is a pretty intense operation. It could have run up your master heap and CPU such that it brought on a long GC pause. Look for a big gap in your logging just before the zk session timeout (In 0.92.0, log splitting is distributed across the cluster; the master does nought but orchestrate the process). In you regionserver node log, there does seem to a big pause just before the zk session timeout. Similarily in the master (at about the point where you've inserted the 'oh no'). For more on what Ted and Michel are referring to, see http://hbase.apache.org/book.html#jvm St.Ack > Some snippets from the logs are pasted below. Does anyone know what may > have caused this? Was the hang really a garbage collection? How might I > prevent this from happening again? > > Thanks, > Garrett > > > // From the master (hadoop01) > 2011-04-28 07:59:24,004 INFO > org.apache.hadoop.hbase.zookeeper.RegionServerTracker: RegionServer > ephemeral node deleted, processing expiration [hadoop11,60020,1302898865706] > 2011-04-28 07:59:24,005 DEBUG org.apache.hadoop.hbase.master.ServerManager: > Added=hadoop11,60020,1302898865706 to dead servers, submitted shutdown handler > to be executed, root=false, meta=false > 2011-04-28 07:59:24,005 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs > for hadoop11,60020,1302898865706 > [ ... ] > 2011-04-28 07:59:24,984 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer > path=hdfs://hadoop01/hbase/processed/03819443e22488ec15418f22c67f02f5/recovered.edits/0000000000060056396 > region=03819443e22488ec15418f22c67f02f5 > 2011-04-28 07:59:25,101 INFO > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs > -- HDFS-200 > 2011-04-28 07:59:25,101 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Creating writer > path=hdfs://hadoop01/hbase/processed/c56e10565f9e97fd19fb0879f59fd8b0/recovered.edits/0000000000060060517 > region=c56e10565f9e97fd19fb0879f59fd8b0 > [ // oh no! a 3 minute gap! ] > 2011-04-28 08:03:58,571 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Pushed=77384 entries > from > hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174692580 > 2011-04-28 08:03:58,572 DEBUG > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 2 of > 29: > hdfs://hadoop01/hbase/.logs/hadoop11,60020,1302898865706/hadoop11%3A60020.1303174703672, > length=63864464 > [ ... ] > 2011-04-28 08:03:59,055 WARN org.apache.hadoop.hbase.util.Sleeper: We slept > 273689ms instead of 1000ms, this is likely due to a long garbage collecting > pause and it's usually bad, see > http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A9 > [ ... ] > 2011-04-28 08:04:00,261 FATAL org.apache.hadoop.hbase.master.HMaster: > master:60000-0x12f4c26d7652910 master:60000-0x12f4c26d7652910 received > expired from ZooKeeper, aborting > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:328) > at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:246) > at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2011-04-28 08:04:00,261 INFO org.apache.hadoop.hbase.master.HMaster: > Aborting > > // From the regionserver (hadoop11) > 2011-04-28 07:41:06,851 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=1013.02 MB, > free=210.38 MB, max=1.19 GB, blocks=14843, accesses=5097608, hits=43675, > hitRatio=0.85%%, cachingAccesses=1016414, cachingHits=43675, > cachingHitsRatio=4.29%%, evictions=528, evicted=957896, > evictedPerRun=1814.197021484375
-
Re: debugging a possible long garbage collectionGarrett Wu 2011-04-29, 22:12
Thanks for the info guys, it was all very helpful.
I do have the mslab option enabled, but it looks like this was just a gc triggered by the load of log splitting as Stack points out. Glad to see distributed log splitting in 0.92. Garrett |