|
Marco Gallotta
2012-12-29, 11:44
Marco Gallotta
2012-12-29, 17:22
Stack
2012-12-29, 23:07
Marco Gallotta
2012-12-30, 11:44
Dhaval Shah
2012-12-29, 13:15
|
-
Hbase master failing to start after reaching 95% disk and expanding clusterMarco Gallotta 2012-12-29, 11:44
Hi there
I've been running an hbase cluster for several months, and it recently experienced problems as the nodes reached 95% disk capacity. I added an extra node, and now the master keeps crashing with the errors below. I also increased the disk capacity on each individual node after this, and the errors are the same. I tried removing the new node, and that doesn't help. There are similar errors in the regionserver and zookeeper logs, but the all seem to echo from the master logs. Anything I can look at to help diagnose what the problem here is? hbase-root-master-analytics.log: Sat Dec 29 03:14:22 PST 2012 Starting master on analytics core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 59480 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 59480 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited 2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC Server listener on 60000 2012-12-29 03:14:25,322 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase 2012-12-29 03:14:28,735 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase 2012-12-29 03:14:32,797 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase 2012-12-29 03:14:41,427 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase 2012-12-29 03:14:41,427 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries 2012-12-29 03:14:41,428 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740) at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146) at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.hbase.util.ServerCommandLine.doMain(ServerCommandLine.java:76) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1754) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021) at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1049) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:176) at org.apache.hadoop.hbase.zookeeper.ZKUtil.createAndFailSilent(ZKUtil.java:896) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.createBaseZNodes(ZooKeeperWatcher.java:161) at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.<init>(ZooKeeperWatcher.java:154) at org.apache.hadoop.hbase.master.HMaster.<init>(HMaster.java:281) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:532) at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1735) ... 5 more Marco Gallotta | Mountain View, California Software Engineer, Infrastructure | Loki Studios fb.me/marco.gallotta | twitter.com/marcog [EMAIL PROTECTED] | +1 (650) 417-3313 Sent with Sparrow (http://www.sparrowmailapp.com/?sig) +
Marco Gallotta 2012-12-29, 11:44
-
Re: Hbase master failing to start after reaching 95% disk and expanding clusterMarco Gallotta 2012-12-29, 17:22
I also see a number of these warnings in the zookeeper logs, which look quite telling. Zookeeper is running on the slaves in question, and port 3888 is unblocked in the firewall.
2012-12-29 07:23:42,492 WARN org.apache.zookeeper.server.quorum.QuorumCnxManager : Cannot open channel to 1 at election address slave1.analytics-internal.lokistu dios.com/10.171.98.247:3888 java.net.ConnectException: Connection refused at java.net.PlainSocketImpl.socketConnect(Native Method) at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.ja va:327) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocket Impl.java:193) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java :180) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:384) at java.net.Socket.connect(Socket.java:546) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(Quorum CnxManager.java:354) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxM anager.java:327) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$Worke rSender.process(FastLeaderElection.java:393) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365) at java.lang.Thread.run(Thread.java:679) -- Marco Gallotta | Mountain View, California Software Engineer, Infrastructure | Loki Studios fb.me/marco.gallotta | twitter.com/marcog [EMAIL PROTECTED] | +1 (650) 417-3313 Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Saturday 29 December 2012 at 1:44 PM, Marco Gallotta wrote: > Hi there > > I've been running an hbase cluster for several months, and it recently experienced problems as the nodes reached 95% disk capacity. I added an extra node, and now the master keeps crashing with the errors below. I also increased the disk capacity on each individual node after this, and the errors are the same. I tried removing the new node, and that doesn't help. > > There are similar errors in the regionserver and zookeeper logs, but the all seem to echo from the master logs. > > Anything I can look at to help diagnose what the problem here is? > > hbase-root-master-analytics.log: > Sat Dec 29 03:14:22 PST 2012 Starting master on analytics > core file size (blocks, -c) 0 > data seg size (kbytes, -d) unlimited > scheduling priority (-e) 0 > file size (blocks, -f) unlimited > pending signals (-i) 59480 > max locked memory (kbytes, -l) 64 > max memory size (kbytes, -m) unlimited > open files (-n) 1024 > pipe size (512 bytes, -p) 8 > POSIX message queues (bytes, -q) 819200 > real-time priority (-r) 0 > stack size (kbytes, -s) 8192 > cpu time (seconds, -t) unlimited > max user processes (-u) 59480 > virtual memory (kbytes, -v) unlimited > file locks (-x) unlimited > 2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 > 2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC Server listener on 60000 > 2012-12-29 03:14:25,322 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase +
Marco Gallotta 2012-12-29, 17:22
-
Re: Hbase master failing to start after reaching 95% disk and expanding clusterStack 2012-12-29, 23:07
On Sat, Dec 29, 2012 at 9:22 AM, Marco Gallotta <[EMAIL PROTECTED]>wrote:
> I also see a number of these warnings in the zookeeper logs, which look > quite telling. Zookeeper is running on the slaves in question, and port > 3888 is unblocked in the firewall. > > 2012-12-29 07:23:42,492 WARN > org.apache.zookeeper.server.quorum.QuorumCnxManager > : Cannot open channel to 1 at election address > slave1.analytics-internal.lokistu > dios.com/10.171.98.247:3888 > java.net.ConnectException: Connection refused > Is this server up and running. Connection refused is pretty plain speaking. Can you ping this server? Can you telnet to the server on the above port? What is in the logs for the server at 98.247? It seems like HBase Master is failing because it cannot connect to zk. St.Ack +
Stack 2012-12-29, 23:07
-
Re: Hbase master failing to start after reaching 95% disk and expanding clusterMarco Gallotta 2012-12-30, 11:44
On Sunday 30 December 2012 at 1:07 AM, Stack wrote:
> On Sat, Dec 29, 2012 at 9:22 AM, Marco Gallotta <[EMAIL PROTECTED] (mailto:[EMAIL PROTECTED])>wrote: > > > I also see a number of these warnings in the zookeeper logs, which look > > quite telling. Zookeeper is running on the slaves in question, and port > > 3888 is unblocked in the firewall. > > > > 2012-12-29 07:23:42,492 WARN > > org.apache.zookeeper.server.quorum.QuorumCnxManager > > : Cannot open channel to 1 at election address > > slave1.analytics-internal.lokistu > > dios.com/10.171.98.247:3888 (http://dios.com/10.171.98.247:3888) > > java.net.ConnectException: Connection refused > > > > > > Is this server up and running. Connection refused is pretty plain > speaking. Can you ping this server? Can you telnet to the server on the > above port? What is in the logs for the server at 98.247? > > It seems like HBase Master is failing because it cannot connect to zk. > > St.Ack Got it working, after making several changes and I'm not quite sure which fixed it. The servers were all running, and I was able to telnet into the server on 3888. -- Marco Gallotta | Mountain View, California Software Engineer, Infrastructure | Loki Studios fb.me/marco.gallotta | twitter.com/marcog [EMAIL PROTECTED] | +1 (650) 417-3313 Sent with Sparrow (http://www.sparrowmailapp.com/?sig) +
Marco Gallotta 2012-12-30, 11:44
-
Re:: Hbase master failing to start after reaching 95% disk and expanding clusterDhaval Shah 2012-12-29, 13:15
Seems like your zookeeper quorum is in a bad state (possibly because it ran out of disk in the past). Can you try stopping hbase, removing the zookeeper znode for hbase and then starting hbase. ------------------------------ On Sat 29 Dec, 2012 5:14 PM IST Marco Gallotta wrote: >Hi there > >I've been running an hbase cluster for several months, and it recently experienced problems as the nodes reached 95% disk capacity. I added an extra node, and now the master keeps crashing with the errors below. I also increased the disk capacity on each individual node after this, and the errors are the same. I tried removing the new node, and that doesn't help. > >There are similar errors in the regionserver and zookeeper logs, but the all seem to echo from the master logs. > >Anything I can look at to help diagnose what the problem here is? > >hbase-root-master-analytics.log: >Sat Dec 29 03:14:22 PST 2012 Starting master on analytics >core file size (blocks, -c) 0 >data seg size (kbytes, -d) unlimited >scheduling priority (-e) 0 >file size (blocks, -f) unlimited >pending signals (-i) 59480 >max locked memory (kbytes, -l) 64 >max memory size (kbytes, -m) unlimited >open files (-n) 1024 >pipe size (512 bytes, -p) 8 >POSIX message queues (bytes, -q) 819200 >real-time priority (-r) 0 >stack size (kbytes, -s) 8192 >cpu time (seconds, -t) unlimited >max user processes (-u) 59480 >virtual memory (kbytes, -v) unlimited >file locks (-x) unlimited >2012-12-29 03:14:24,601 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,614 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,622 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,631 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,636 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,643 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,651 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,665 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,675 INFO org.apache.hadoop.ipc.HBaseServer: Starting Thread-2 >2012-12-29 03:14:24,698 INFO org.apache.hadoop.ipc.HBaseServer: Starting IPC Server listener on 60000 >2012-12-29 03:14:25,322 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase >2012-12-29 03:14:28,735 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase > >2012-12-29 03:14:32,797 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase >2012-12-29 03:14:41,427 WARN org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: Possibly transient ZooKeeper exception: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase >2012-12-29 03:14:41,427 ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper: ZooKeeper exists failed after 3 retries >2012-12-29 03:14:41,428 ERROR org.apache.hadoop.hbase.master.HMasterCommandLine: Failed to start master >java.lang.RuntimeException: Failed construction of Master: class org.apache.hadoop.hbase.master.HMaster > > at org.apache.hadoop.hbase.master.HMaster.constructMaster(HMaster.java:1740) > at org.apache.hadoop.hbase.master.HMasterCommandLine.startMaster(HMasterCommandLine.java:146) > at org.apache.hadoop.hbase.master.HMasterCommandLine.run(HMasterCommandLine.java:103) +
Dhaval Shah 2012-12-29, 13:15
|