|
lztaomin
2012-07-02, 08:30
Lars George
2012-07-02, 09:42
lztaomin
2012-07-05, 06:59
N Keywal
2012-07-05, 07:32
Pablo Musa
2012-07-05, 21:37
Dhaval Shah
2012-07-05, 23:54
Gaojinchao
2012-07-06, 00:37
Pablo Musa
2012-07-10, 14:28
Gopinathan A
2012-07-10, 15:03
Laxman
2012-07-11, 04:41
Pablo Musa
2012-07-11, 15:12
Laxman
2012-07-12, 04:22
Pablo Musa
2012-07-16, 22:11
Sandy Pratt
2012-07-16, 22:55
|
-
Hmaster and HRegionServer disappearance reason to asklztaomin 2012-07-02, 08:30
HI ALL
My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. Thanks The following is a log ABORTING region server serverName=datanode1,60020,1325326435553, load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): regionserver:60020-0x3488dec38a02b1 regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2012-07-01 13:45:38,707 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for datanode1,60020,1325326435553 2012-07-01 13:45:38,756 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32 hlog(s) in hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352, length=5671397 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 2012-07-01 13:45:39,766 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 2012-07-01 13:45:39,880 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2012-07-01 13:45:39,925 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 ABORTING region server serverName=datanode2,60020,1325146199444, load=(requests=614, regions=189, usedHeap=3662, maxHeap=8165): regionserver:60020-0x3488dec38a0002 regionserver:60020-0x3488dec38a0002 received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2012-07-01 13:24:10,308 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341075090535 2012-07-01 13:24:10,918 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 21 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341078690560, length=11778108 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://namenode:9000/hbase/t_speakfor_relation_chapter/ffd2057b46da227e078c82ff43f0f9f2/recovered.edits/0000000000660951991 (wrote 8178 edits in 403ms) 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in -1268935 ms for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 2012-07-01 13:24:29,824 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Received exception accessing META during server shutdown of datanode1,60020,1325326435553, retrying META read org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running, aborting at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2408) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionInfo(HRegionServer.java:1649) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) lztaomin
-
Re: Hmaster and HRegionServer disappearance reason to askLars George 2012-07-02, 09:42
Hi lztaomin,
> org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired indicates that you have experienced the "Juliet Pause" issue, which means you ran into a JVM garbage collection that lasted longer than the configured ZooKeeper timeout threshold. If you search for it on Google http://www.google.com/search?q=juliet+pause+hbase you will find quite a few pages explaining the problem, and what you can do to avoid this. Lars On Jul 2, 2012, at 10:30 AM, lztaomin wrote: > HI ALL > > My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. > Thanks > > The following is a log > > ABORTING region server serverName=datanode1,60020,1325326435553, load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): regionserver:60020-0x3488dec38a02b1 regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:45:38,707 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for datanode1,60020,1325326435553 > 2012-07-01 13:45:38,756 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32 hlog(s) in hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 > 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352, length=5671397 > 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 > 2012-07-01 13:45:39,766 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 > 2012-07-01 13:45:39,880 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 > 2012-07-01 13:45:39,925 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 > > ABORTING region server serverName=datanode2,60020,1325146199444, load=(requests=614, regions=189, usedHeap=3662, maxHeap=8165): regionserver:60020-0x3488dec38a0002 regionserver:60020-0x3488dec38a0002 received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:24:10,308 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341075090535 > 2012-07-01 13:24:10,918 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 21 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341078690560, length=11778108 > 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://namenode:9000/hbase/t_speakfor_relation_chapter/ffd2057b46da227e078c82ff43f0f9f2/recovered.edits/0000000000660951991 (wrote 8178 edits in 403ms)
-
Hmaster and HRegionServer disappearance reason to asklztaomin 2012-07-05, 06:59
HI ALL
My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. Thanks The following is a log ABORTING region server serverName=datanode1,60020,1325326435553, load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): regionserver:60020-0x3488dec38a02b1 regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2012-07-01 13:45:38,707 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for datanode1,60020,1325326435553 2012-07-01 13:45:38,756 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32 hlog(s) in hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352, length=5671397 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 2012-07-01 13:45:39,766 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 2012-07-01 13:45:39,880 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 2012-07-01 13:45:39,925 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 ABORTING region server serverName=datanode2,60020,1325146199444, load=(requests=614, regions=189, usedHeap=3662, maxHeap=8165): regionserver:60020-0x3488dec38a0002 regionserver:60020-0x3488dec38a0002 received expired from ZooKeeper, aborting Cause: org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) 2012-07-01 13:24:10,308 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341075090535 2012-07-01 13:24:10,918 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 21 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341078690560, length=11778108 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://namenode:9000/hbase/t_speakfor_relation_chapter/ffd2057b46da227e078c82ff43f0f9f2/recovered.edits/0000000000660951991 (wrote 8178 edits in 403ms) 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in -1268935 ms for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 2012-07-01 13:24:29,824 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Received exception accessing META during server shutdown of datanode1,60020,1325326435553, retrying META read org.apache.hadoop.ipc.RemoteException: java.io.IOException: Server not running, aborting at org.apache.hadoop.hbase.regionserver.HRegionServer.checkOpen(HRegionServer.java:2408) at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegionInfo(HRegionServer.java:1649) at sun.reflect.GeneratedMethodAccessor15.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:570) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1039) lztaomin
-
Re: Hmaster and HRegionServer disappearance reason to askN Keywal 2012-07-05, 07:32
Hi,
It's a ZK expiry on sunday 1st. Root cause could be the leap second bug? N. On Thu, Jul 5, 2012 at 8:59 AM, lztaomin <[EMAIL PROTECTED]> wrote: > HI ALL > My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. > Thanks > > The following is a log > > ABORTING region server serverName=datanode1,60020,1325326435553, load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): regionserver:60020-0x3488dec38a02b1 regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:45:38,707 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Splitting logs for datanode1,60020,1325326435553 > 2012-07-01 13:45:38,756 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32 hlog(s) in hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 > 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 1 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352, length=5671397 > 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.util.FSUtils: Recovering file hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 > 2012-07-01 13:45:39,766 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341006689352 > 2012-07-01 13:45:39,880 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 > 2012-07-01 13:45:39,925 INFO org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: Using syncFs -- HDFS-200 > > ABORTING region server serverName=datanode2,60020,1325146199444, load=(requests=614, regions=189, usedHeap=3662, maxHeap=8165): regionserver:60020-0x3488dec38a0002 regionserver:60020-0x3488dec38a0002 received expired from ZooKeeper, aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:343) > at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:261) > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:530) > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:24:10,308 INFO org.apache.hadoop.hbase.util.FSUtils: Finished lease recover attempt for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341075090535 > 2012-07-01 13:24:10,918 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog 21 of 32: hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanode1%3A60020.1341078690560, length=11778108 > 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Closed path hdfs://namenode:9000/hbase/t_speakfor_relation_chapter/ffd2057b46da227e078c82ff43f0f9f2/recovered.edits/0000000000660951991 (wrote 8178 edits in 403ms) > 2012-07-01 13:24:29,809 INFO org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: hlog file splitting completed in -1268935 ms for hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 > 2012-07-01 13:24:29,824 INFO org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: Received exception accessing META during server shutdown of datanode1,60020,1325326435553, retrying META read
-
RE: Hmaster and HRegionServer disappearance reason to askPablo Musa 2012-07-05, 21:37
I am having the same problem. I tried N different things but I cannot solve the problem.
hadoop-0.20.noarch 0.20.2+923.256-1 hadoop-hbase.noarch 0.90.6+84.29-1 hadoop-zookeeper.noarch 3.3.5+19.1-1 I already set: <property> <name>hbase.hregion.memstore.mslab.enabled</name> <value>true</value> </property> <property> <name>hbase.regionserver.handler.count</name> <value>20</value> </property> But it does not seem to work. How can I check if this variables are really set in the HRegionServer? I am starting the server with the following: -Xmx8192m -XX:NewSize=64m -XX:MaxNewSize=64m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps I am also having trouble to read reagionserver.out [GC 72004.406: [ParNew: 55830K->2763K(59008K), 0.0043820 secs] 886340K->835446K(1408788K) icms_dc=0 , 0.0044900 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] [GC 72166.759: [ParNew: 55192K->6528K(59008K), 135.1102750 secs] 887876K->839688K(1408788K) icms_dc=0 , 135.1103920 secs] [Times: user=1045.58 sys=138.11, real=135.09 secs] [GC 72552.616: [ParNew: 58977K->6528K(59008K), 0.0083040 secs] 892138K->847415K(1408788K) icms_dc=0 , 0.0084060 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] [GC 72882.991: [ParNew: 58979K->6528K(59008K), 151.4924490 secs] 899866K->853931K(1408788K) icms_dc=0 , 151.4925690 secs] [Times: user=0.07 sys=151.48, real=151.47 secs] What does each part means? Each line is a GC cicle? Thanks, Pablo -----Original Message----- From: Lars George [mailto:[EMAIL PROTECTED]] Sent: segunda-feira, 2 de julho de 2012 06:43 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Hi lztaomin, > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired indicates that you have experienced the "Juliet Pause" issue, which means you ran into a JVM garbage collection that lasted longer than the configured ZooKeeper timeout threshold. If you search for it on Google http://www.google.com/search?q=juliet+pause+hbase you will find quite a few pages explaining the problem, and what you can do to avoid this. Lars On Jul 2, 2012, at 10:30 AM, lztaomin wrote: > HI ALL > > My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. > Thanks > > The following is a log > > ABORTING region server serverName=datanode1,60020,1325326435553, > load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): > regionserver:60020-0x3488dec38a02b1 > regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, > aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo > KeeperWatcher.java:343) at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa > tcher.java:261) at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja > va:530) at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:45:38,707 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: > Splitting logs for datanode1,60020,1325326435553 > 2012-07-01 13:45:38,756 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32 > hlog(s) in > hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 > 2012-07-01 13:45:38,764 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog > 1 of 32: > hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553/datanod > e1%3A60020.1341006689352, length=5671397 > 2012-07-01 13:45:38,764 INFO org.apache.hadoop.hbase.util.FSUtils:
-
Re: Hmaster and HRegionServer disappearance reason to askDhaval Shah 2012-07-05, 23:54
Pablo, instead of CMSIncrementalMode try UseParNewGC.. That seemed to be the silver bullet when I was dealing with HBase region server crashes
Regards, Dhaval ________________________________ From: Pablo Musa <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Thursday, 5 July 2012 5:37 PM Subject: RE: Hmaster and HRegionServer disappearance reason to ask I am having the same problem. I tried N different things but I cannot solve the problem. hadoop-0.20.noarch 0.20.2+923.256-1 hadoop-hbase.noarch 0.90.6+84.29-1 hadoop-zookeeper.noarch 3.3.5+19.1-1 I already set: <property> <name>hbase.hregion.memstore.mslab.enabled</name> <value>true</value> </property> <property> <name>hbase.regionserver.handler.count</name> <value>20</value> </property> But it does not seem to work. How can I check if this variables are really set in the HRegionServer? I am starting the server with the following: -Xmx8192m -XX:NewSize=64m -XX:MaxNewSize=64m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps I am also having trouble to read reagionserver.out [GC 72004.406: [ParNew: 55830K->2763K(59008K), 0.0043820 secs] 886340K->835446K(1408788K) icms_dc=0 , 0.0044900 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] [GC 72166.759: [ParNew: 55192K->6528K(59008K), 135.1102750 secs] 887876K->839688K(1408788K) icms_dc=0 , 135.1103920 secs] [Times: user=1045.58 sys=138.11, real=135.09 secs] [GC 72552.616: [ParNew: 58977K->6528K(59008K), 0.0083040 secs] 892138K->847415K(1408788K) icms_dc=0 , 0.0084060 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] [GC 72882.991: [ParNew: 58979K->6528K(59008K), 151.4924490 secs] 899866K->853931K(1408788K) icms_dc=0 , 151.4925690 secs] [Times: user=0.07 sys=151.48, real=151.47 secs] What does each part means? Each line is a GC cicle? Thanks, Pablo -----Original Message----- From: Lars George [mailto:[EMAIL PROTECTED]] Sent: segunda-feira, 2 de julho de 2012 06:43 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Hi lztaomin, > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired indicates that you have experienced the "Juliet Pause" issue, which means you ran into a JVM garbage collection that lasted longer than the configured ZooKeeper timeout threshold. If you search for it on Google http://www.google.com/search?q=juliet+pause+hbase you will find quite a few pages explaining the problem, and what you can do to avoid this. Lars On Jul 2, 2012, at 10:30 AM, lztaomin wrote: > HI ALL > > My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. > Thanks > > The following is a log > > ABORTING region server serverName=datanode1,60020,1325326435553, > load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): > regionserver:60020-0x3488dec38a02b1 > regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, > aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo > KeeperWatcher.java:343) at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa > tcher.java:261) at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja > va:530) at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:45:38,707 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: > Splitting logs for datanode1,60020,1325326435553 > 2012-07-01 13:45:38,756 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32
-
re: Hmaster and HRegionServer disappearance reason to askGaojinchao 2012-07-06, 00:37
Did you check "http://hbase.apache.org/book.html#perf.os.swap"?
-----邮件原件----- 发件人: Pablo Musa [mailto:[EMAIL PROTECTED]] 发送时间: 2012年7月6日 5:38 收件人: [EMAIL PROTECTED] 主题: RE: Hmaster and HRegionServer disappearance reason to ask I am having the same problem. I tried N different things but I cannot solve the problem. hadoop-0.20.noarch 0.20.2+923.256-1 hadoop-hbase.noarch 0.90.6+84.29-1 hadoop-zookeeper.noarch 3.3.5+19.1-1 I already set: <property> <name>hbase.hregion.memstore.mslab.enabled</name> <value>true</value> </property> <property> <name>hbase.regionserver.handler.count</name> <value>20</value> </property> But it does not seem to work. How can I check if this variables are really set in the HRegionServer? I am starting the server with the following: -Xmx8192m -XX:NewSize=64m -XX:MaxNewSize=64m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps I am also having trouble to read reagionserver.out [GC 72004.406: [ParNew: 55830K->2763K(59008K), 0.0043820 secs] 886340K->835446K(1408788K) icms_dc=0 , 0.0044900 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] [GC 72166.759: [ParNew: 55192K->6528K(59008K), 135.1102750 secs] 887876K->839688K(1408788K) icms_dc=0 , 135.1103920 secs] [Times: user=1045.58 sys=138.11, real=135.09 secs] [GC 72552.616: [ParNew: 58977K->6528K(59008K), 0.0083040 secs] 892138K->847415K(1408788K) icms_dc=0 , 0.0084060 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] [GC 72882.991: [ParNew: 58979K->6528K(59008K), 151.4924490 secs] 899866K->853931K(1408788K) icms_dc=0 , 151.4925690 secs] [Times: user=0.07 sys=151.48, real=151.47 secs] What does each part means? Each line is a GC cicle? Thanks, Pablo -----Original Message----- From: Lars George [mailto:[EMAIL PROTECTED]] Sent: segunda-feira, 2 de julho de 2012 06:43 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Hi lztaomin, > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired indicates that you have experienced the "Juliet Pause" issue, which means you ran into a JVM garbage collection that lasted longer than the configured ZooKeeper timeout threshold. If you search for it on Google http://www.google.com/search?q=juliet+pause+hbase you will find quite a few pages explaining the problem, and what you can do to avoid this. Lars On Jul 2, 2012, at 10:30 AM, lztaomin wrote: > HI ALL > > My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. > Thanks > > The following is a log > > ABORTING region server serverName=datanode1,60020,1325326435553, > load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): > regionserver:60020-0x3488dec38a02b1 > regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, > aborting > Cause: > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(Zoo > KeeperWatcher.java:343) at > org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWa > tcher.java:261) at > org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.ja > va:530) at > org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:506) > 2012-07-01 13:45:38,707 INFO > org.apache.hadoop.hbase.master.handler.ServerShutdownHandler: > Splitting logs for datanode1,60020,1325326435553 > 2012-07-01 13:45:38,756 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting 32 > hlog(s) in > hdfs://namenode:9000/hbase/.logs/datanode1,60020,1325326435553 > 2012-07-01 13:45:38,764 INFO > org.apache.hadoop.hbase.regionserver.wal.HLogSplitter: Splitting hlog
-
RE: Hmaster and HRegionServer disappearance reason to askPablo Musa 2012-07-10, 14:28
I tried to change the flag but yesterday it happened again:
Application time: 0.3025790 seconds 30013.866: [GC 30013.866: [ParNew: 106069K->989K(118016K), 178.8437590 secs] 473853K->369013K(1275392K), 178.8438570 secs] [Times: user=0.05 sys=178.82, real=178.81 secs] Total time for which application threads were stopped: 178.8441500 seconds I also checked the possibility of a swap, but I don't think it is the problem as vmstat always show clean swap. Help guys, please :) Abs, Pablo -----Original Message----- From: Dhaval Shah [mailto:[EMAIL PROTECTED]] Sent: quinta-feira, 5 de julho de 2012 20:55 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Pablo, instead of CMSIncrementalMode try UseParNewGC.. That seemed to be the silver bullet when I was dealing with HBase region server crashes Regards, Dhaval ________________________________ From: Pablo Musa <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Thursday, 5 July 2012 5:37 PM Subject: RE: Hmaster and HRegionServer disappearance reason to ask I am having the same problem. I tried N different things but I cannot solve the problem. hadoop-0.20.noarch 0.20.2+923.256-1 hadoop-hbase.noarch 0.90.6+84.29-1 hadoop-zookeeper.noarch 3.3.5+19.1-1 I already set: <property> <name>hbase.hregion.memstore.mslab.enabled</name> <value>true</value> </property> <property> <name>hbase.regionserver.handler.count</name> <value>20</value> </property> But it does not seem to work. How can I check if this variables are really set in the HRegionServer? I am starting the server with the following: -Xmx8192m -XX:NewSize=64m -XX:MaxNewSize=64m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps I am also having trouble to read reagionserver.out [GC 72004.406: [ParNew: 55830K->2763K(59008K), 0.0043820 secs] 886340K->835446K(1408788K) icms_dc=0 , 0.0044900 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] [GC 72166.759: [ParNew: 55192K->6528K(59008K), 135.1102750 secs] 887876K->839688K(1408788K) icms_dc=0 , 135.1103920 secs] [Times: user=1045.58 sys=138.11, real=135.09 secs] [GC 72552.616: [ParNew: 58977K->6528K(59008K), 0.0083040 secs] 892138K->847415K(1408788K) icms_dc=0 , 0.0084060 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] [GC 72882.991: [ParNew: 58979K->6528K(59008K), 151.4924490 secs] 899866K->853931K(1408788K) icms_dc=0 , 151.4925690 secs] [Times: user=0.07 sys=151.48, real=151.47 secs] What does each part means? Each line is a GC cicle? Thanks, Pablo -----Original Message----- From: Lars George [mailto:[EMAIL PROTECTED]] Sent: segunda-feira, 2 de julho de 2012 06:43 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Hi lztaomin, > org.apache.zookeeper.KeeperException$SessionExpiredException: > KeeperErrorCode = Session expired indicates that you have experienced the "Juliet Pause" issue, which means you ran into a JVM garbage collection that lasted longer than the configured ZooKeeper timeout threshold. If you search for it on Google http://www.google.com/search?q=juliet+pause+hbase you will find quite a few pages explaining the problem, and what you can do to avoid this. Lars On Jul 2, 2012, at 10:30 AM, lztaomin wrote: > HI ALL > > My HBase group a total of 3 machine, Hadoop HBase mounted in the same machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me. > Thanks > > The following is a log > > ABORTING region server serverName=datanode1,60020,1325326435553, > load=(requests=332, regions=188, usedHeap=2741, maxHeap=8165): > regionserver:60020-0x3488dec38a02b1 > regionserver:60020-0x3488dec38a02b1 received expired from ZooKeeper, > aborting
-
RE: Hmaster and HRegionServer disappearance reason to askGopinathan A 2012-07-10, 15:03
I doubt on your system load, because normally GC will not take more time to
collect from 473MB (used heap), here allocated heap just gone upto 1.2GB. Can u check system load factor from the top command & % system wait. What is your system configuration? Thanks & Regards, Gopinathan A **************************************************************************** *********** This e-mail and attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient's) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it! -----Original Message----- From: Pablo Musa [mailto:[EMAIL PROTECTED]] Sent: Tuesday, July 10, 2012 7:58 PM To: [EMAIL PROTECTED] Subject: RE: Hmaster and HRegionServer disappearance reason to ask I tried to change the flag but yesterday it happened again: Application time: 0.3025790 seconds 30013.866: [GC 30013.866: [ParNew: 106069K->989K(118016K), 178.8437590 secs] 473853K->369013K(1275392K), 178.8438570 secs] [Times: user=0.05 sys=178.82, real=178.81 secs] Total time for which application threads were stopped: 178.8441500 seconds I also checked the possibility of a swap, but I don't think it is the problem as vmstat always show clean swap. Help guys, please :) Abs, Pablo -----Original Message----- From: Dhaval Shah [mailto:[EMAIL PROTECTED]] Sent: quinta-feira, 5 de julho de 2012 20:55 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Pablo, instead of CMSIncrementalMode try UseParNewGC.. That seemed to be the silver bullet when I was dealing with HBase region server crashes Regards, Dhaval ________________________________ From: Pablo Musa <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Thursday, 5 July 2012 5:37 PM Subject: RE: Hmaster and HRegionServer disappearance reason to ask I am having the same problem. I tried N different things but I cannot solve the problem. hadoop-0.20.noarch 0.20.2+923.256-1 hadoop-hbase.noarch 0.90.6+84.29-1 hadoop-zookeeper.noarch 3.3.5+19.1-1 I already set: <property> <name>hbase.hregion.memstore.mslab.enabled</name> <value>true</value> </property> <property> <name>hbase.regionserver.handler.count</name> <value>20</value> </property> But it does not seem to work. How can I check if this variables are really set in the HRegionServer? I am starting the server with the following: -Xmx8192m -XX:NewSize=64m -XX:MaxNewSize=64m -ea -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode -XX:+UseConcMarkSweepGC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps I am also having trouble to read reagionserver.out [GC 72004.406: [ParNew: 55830K->2763K(59008K), 0.0043820 secs] 886340K->835446K(1408788K) icms_dc=0 , 0.0044900 secs] [Times: user=0.04 sys=0.00, real=0.00 secs] [GC 72166.759: [ParNew: 55192K->6528K(59008K), 135.1102750 secs] 887876K->839688K(1408788K) icms_dc=0 , 135.1103920 secs] [Times: user=1045.58 sys=138.11, real=135.09 secs] [GC 72552.616: [ParNew: 58977K->6528K(59008K), 0.0083040 secs] 892138K->847415K(1408788K) icms_dc=0 , 0.0084060 secs] [Times: user=0.05 sys=0.01, real=0.01 secs] [GC 72882.991: [ParNew: 58979K->6528K(59008K), 151.4924490 secs] 899866K->853931K(1408788K) icms_dc=0 , 151.4925690 secs] [Times: user=0.07 sys=151.48, real=151.47 secs] What does each part means? Each line is a GC cicle? Thanks, Pablo -----Original Message----- From: Lars George [mailto:[EMAIL PROTECTED]] Sent: segunda-feira, 2 de julho de 2012 06:43 To: [EMAIL PROTECTED] Subject: Re: Hmaster and HRegionServer disappearance reason to ask Hi lztaomin, > org.apache.zookeeper.KeeperException$SessionExpiredException: indicates that you have experienced the "Juliet Pause" issue, which means you ran into a JVM garbage collection that lasted longer than the configured ZooKeeper timeout threshold. If you search for it on Google http://www.google.com/search?q=juliet+pause+hbase you will find quite a few pages explaining the problem, and what you can do to avoid this. Lars On Jul 2, 2012, at 10:30 AM, lztaomin wrote: machine, zookeeper using HBase own. Operation 3 months after the reported abnormal as follows. Cause hmaster and HRegionServer processes are gone. Please help me.
-
RE: Hmaster and HRegionServer disappearance reason to askLaxman 2012-07-11, 04:41
IIUC, this seems to be a problem with Direct memory cleanup.
I've seen this problem mostly with application using lot of NIO (RegionServer, DataNode). Symptoms of this problem 1) Allocated heap is 1 GB 2) VIRT/RES will occupy 10GB or even more 3) GC logs says, it takes more than 1 minute to clean up where allocated heap is just few GB (<4GB) 4) Your application uses lot of NIO/RMI calls(DataNode, RegionServer) How to solve? 1) Fix the direct memory usage to a fixed value -XX:MaxDirectMemorySize=1G 2) DON'T set -XX:+DisableExplicitGC Recent issue which is relevant to this HDFS-3600. Please let us know your findings. -- Regards, Laxman > -----Original Message----- > From: Pablo Musa [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 10, 2012 7:58 PM > To: [EMAIL PROTECTED] > Subject: RE: Hmaster and HRegionServer disappearance reason to ask > > I tried to change the flag but yesterday it happened again: > > Application time: 0.3025790 seconds > 30013.866: [GC 30013.866: [ParNew: 106069K->989K(118016K), 178.8437590 > secs] 473853K->369013K(1275392K), 178.8438570 secs] [Times: user=0.05 > sys=178.82, real=178.81 secs] > Total time for which application threads were stopped: 178.8441500 > seconds > > I also checked the possibility of a swap, but I don't think it is the > problem as vmstat always show clean swap. > > Help guys, please :) > > Abs, > Pablo
-
RE: Hmaster and HRegionServer disappearance reason to askPablo Musa 2012-07-11, 15:12
> 1) Fix the direct memory usage to a fixed value -XX:MaxDirectMemorySize=1G
This flag should be in RS ou DN? Thanks, Pablo -----Original Message----- From: Laxman [mailto:[EMAIL PROTECTED]] Sent: quarta-feira, 11 de julho de 2012 01:41 To: [EMAIL PROTECTED]; Pablo Musa Subject: RE: Hmaster and HRegionServer disappearance reason to ask IIUC, this seems to be a problem with Direct memory cleanup. I've seen this problem mostly with application using lot of NIO (RegionServer, DataNode). Symptoms of this problem 1) Allocated heap is 1 GB 2) VIRT/RES will occupy 10GB or even more 3) GC logs says, it takes more than 1 minute to clean up where allocated heap is just few GB (<4GB) 4) Your application uses lot of NIO/RMI calls(DataNode, RegionServer) How to solve? 1) Fix the direct memory usage to a fixed value -XX:MaxDirectMemorySize=1G 2) DON'T set -XX:+DisableExplicitGC Recent issue which is relevant to this HDFS-3600. Please let us know your findings. -- Regards, Laxman > -----Original Message----- > From: Pablo Musa [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, July 10, 2012 7:58 PM > To: [EMAIL PROTECTED] > Subject: RE: Hmaster and HRegionServer disappearance reason to ask > > I tried to change the flag but yesterday it happened again: > > Application time: 0.3025790 seconds > 30013.866: [GC 30013.866: [ParNew: 106069K->989K(118016K), 178.8437590 > secs] 473853K->369013K(1275392K), 178.8438570 secs] [Times: user=0.05 > sys=178.82, real=178.81 secs] Total time for which application threads > were stopped: 178.8441500 seconds > > I also checked the possibility of a swap, but I don't think it is the > problem as vmstat always show clean swap. > > Help guys, please :) > > Abs, > Pablo
-
RE: Hmaster and HRegionServer disappearance reason to askLaxman 2012-07-12, 04:22
> > 1) Fix the direct memory usage to a fixed value -
> XX:MaxDirectMemorySize=1G > > This flag should be in RS ou DN? We need to apply for both but limit can be increased based on your load (May be 2G). Also we can to apply to all processes which are having following symptoms. 1) Allocated heap is few GB (4 to 8) 2) VIRT/RES will occupy double the heap (like 15GB) or even more 3) Long pauses in GC log (allocated heap is just <=8GB) 4) Your application uses lot of NIO/RMI calls(Ex: DataNode, RegionServer) In our cluster we apply for all server processes (NN, DN, HM, RS, JT, TT, ZooKeeper). Long pauses are disappeared after we set this flag (esp. for DN and RS). -- Regards, Laxman
-
RE: Hmaster and HRegionServer disappearance reason to askPablo Musa 2012-07-16, 22:11
Hey guys,
thanks for help but I am stuck. I tried changing the GC: " instead of CMSIncrementalMode try UseParNewGC" Also checked for swap, which in vmstat is always zero and analizying top is not an option. Load factor never gets higher than 10.0 in a 16 cpu and usually it I around 1.5. Finally, I tried the "-XX:MaxDirectMemorySize=2G" in the datanode, but nothing changed. Datanode still has a lot of the following errors and RS keep falling 3 times a day after GC timeout: 2012-07-16 10:13:13,362 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.17.2.22:50010, storageID=DS-554036718-127.0.0.1-50010-1318903052632, infoPort=50075, ipcPort=50020):DataXceiver java.io.EOFException: while trying to read 65557 bytes at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:290) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:334) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:398) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:577) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:494) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:183) ----------------------------------------------------------------------- 2012-07-16 10:14:25,583 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(172.17.2.22:50010, storageID=DS-554036718-127.0.0.1-50010-1318903052632, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 480000 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/172.17.2.22:50010 remote=/172.17.2.22:49590] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) I have not tried the flag in the RS, but I really want to solve the DN problems above!! Guys, do you have any idea? Thanks, Pablo -----Original Message----- From: Laxman [mailto:[EMAIL PROTECTED]] Sent: quinta-feira, 12 de julho de 2012 01:22 To: Pablo Musa; [EMAIL PROTECTED] Subject: RE: Hmaster and HRegionServer disappearance reason to ask > > 1) Fix the direct memory usage to a fixed value - > XX:MaxDirectMemorySize=1G > > This flag should be in RS ou DN? We need to apply for both but limit can be increased based on your load (May be 2G). Also we can to apply to all processes which are having following symptoms. 1) Allocated heap is few GB (4 to 8) 2) VIRT/RES will occupy double the heap (like 15GB) or even more 3) Long pauses in GC log (allocated heap is just <=8GB) 4) Your application uses lot of NIO/RMI calls(Ex: DataNode, RegionServer) In our cluster we apply for all server processes (NN, DN, HM, RS, JT, TT, ZooKeeper). Long pauses are disappeared after we set this flag (esp. for DN and RS). -- Regards, Laxman
-
RE: Hmaster and HRegionServer disappearance reason to askSandy Pratt 2012-07-16, 22:55
This sounds similar to something I've seen before, but in that case I found the winning GC arguments to be something like
-XX:+UseParallelGC -XX:+UseParallelOldGC -XX:MaxDirectMemorySize=128m (note the old gen parallel compacting collector rather than the ParNew collector which IIRC is used with concurrent GC by default) I don't recall the MaxDirectMemorySize on its own preventing massive off-heap memory allocations from piling up. Just my 2 cents, YMMV. Sandy -----Original Message----- From: Laxman [mailto:[EMAIL PROTECTED]] Sent: Wednesday, July 11, 2012 9:22 PM To: 'Pablo Musa'; [EMAIL PROTECTED] Subject: RE: Hmaster and HRegionServer disappearance reason to ask > > 1) Fix the direct memory usage to a fixed value - > XX:MaxDirectMemorySize=1G > > This flag should be in RS ou DN? We need to apply for both but limit can be increased based on your load (May be 2G). Also we can to apply to all processes which are having following symptoms. 1) Allocated heap is few GB (4 to 8) 2) VIRT/RES will occupy double the heap (like 15GB) or even more 3) Long pauses in GC log (allocated heap is just <=8GB) 4) Your application uses lot of NIO/RMI calls(Ex: DataNode, RegionServer) In our cluster we apply for all server processes (NN, DN, HM, RS, JT, TT, ZooKeeper). Long pauses are disappeared after we set this flag (esp. for DN and RS). -- Regards, Laxman |