Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> Zookeeper service is down when Leader disk is full


Copy link to this message
-
RE: Zookeeper service is down when Leader disk is full
We have analyzed all other request processors exit patterns and could not
find this pattern in any of them.

Found that this has introduced as part of ZOOKEEPER-121.
System.exit and thread.join on same thread is causing this hang.

I've also gone through Ted's earlier response on disk full scenario.
http://www.google.co.in/url?sa=t&source=web&cd=3&ved=0CCAQFjAC&url=http%3A%2
F%2Fmail-archives.apache.org%2Fmod_mbox%2Fzookeeper-user%2F201106.mbox%2F%25
3CBANLkTimzQjXZvDKnP6xQLF9jHfsaz6JstA%40mail.gmail.com%253E&ei=FBQETvPWIcLNr
Qfk75yaDA&usg=AFQjCNFTkguyxTligpz1TZBmkqe9Osz-uw

We feel, even when one of the cluster member's disk is full, we should not
interrupt the complete service.

So, raised a new jira for this issue.
https://issues.apache.org/jira/browse/ZOOKEEPER-1109
-----Original Message-----
From: Laxman [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 22, 2011 1:54 PM
To: [EMAIL PROTECTED]
Subject: Zookeeper service is down when Leader disk is full

Hi Everyone,

  

We have found one issue while testing the disk space full scenario. Request
you to validate our observations. Will log an issue if this found to be
valid.

 

Problem: Zookeeper is not shut down completely when dataDir disk space is
full and ZK Cluster went into unserviceable state.
Version: Zookeeper 3.3.3

 

Scenario
If the leader zookeeper disk is made full, the zookeeper is trying to
shutdown. But it is waiting indefinitely while shutting down the
SyncRequestProcessor thread.

Root Cause: this.join() is invoked in the same thread where System.exit(11)
has been triggered.
When disk space full happens, It got the exception as follows 'No space left
on device' and invoked System.exit(11) from the SyncRequestProcessor
thread(The following logs shows the same). Before exiting JVM, ZK will
execute the ShutdownHook of QuorumPeerMain and the flow comes to
SyncRequestProcessor.shutdown(). Here this.join() is invoked in the same
thread where System.exit(11) has been invoked.

Thread dumps:

The following thread dump shows the QuorumPeerMain thread is infntely
waiting inside SyncRequestProcessor.

"Thread-2" prio=10 tid=0x0810a400 nid=0x1695 in Object.wait() [0xac783000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xb804f5e8> (a
org.apache.zookeeper.server.SyncRequestProcessor)
        at java.lang.Thread.join(Thread.java:1143)
        - locked <0xb804f5e8> (a
org.apache.zookeeper.server.SyncRequestProcessor)
        at java.lang.Thread.join(Thread.java:1196)
        at
org.apache.zookeeper.server.SyncRequestProcessor.shutdown(SyncRequestProcess
or.java:171)
        at
org.apache.zookeeper.server.quorum.ProposalRequestProcessor.shutdown(Proposa
lRequestProcessor.java:79)
        at
org.apache.zookeeper.server.PrepRequestProcessor.shutdown(PrepRequestProcess
or.java:513)
        at
org.apache.zookeeper.server.ZooKeeperServer.shutdown(ZooKeeperServer.java:41
3)
        at
org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:411)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694)
        at
org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:
126)

"SyncThread:2" prio=10 tid=0xad7fd800 nid=0x4acb in Object.wait()
[0xac9ba000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(Native Method)
        - waiting on <0xb8030d00> (a
org.apache.zookeeper.server.quorum.QuorumPeerMain$1)
        at java.lang.Thread.join(Thread.java:1143)
        - locked <0xb8030d00> (a
org.apache.zookeeper.server.quorum.QuorumPeerMain$1)
        at java.lang.Thread.join(Thread.java:1196)
        at
java.lang.ApplicationShutdownHooks.runHooks(ApplicationShutdownHooks.java:79
)
        at
java.lang.ApplicationShutdownHooks$1.run(ApplicationShutdownHooks.java:24)
        at java.lang.Shutdown.runHooks(Shutdown.java:79)
        at java.lang.Shutdown.sequence(Shutdown.java:123)
        at java.lang.Shutdown.exit(Shutdown.java:168)
        - locked <0xf01ff3b0> (a java.lang.Class for java.lang.Shutdown)
        at java.lang.Runtime.exit(Runtime.java:90)
        at java.lang.System.exit(System.java:904)
        at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.ja
va:149)

Logs :
2011-06-21 10:09:59,730 - FATAL [SyncThread:2:SyncRequestProcessor@148] -
Severe unrecoverable error, exiting
java.io.IOException: No space left on device
        at java.io.FileOutputStream.writeBytes(Native Method)
        at java.io.FileOutputStream.write(FileOutputStream.java:260)
        at
java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
        at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)

        at
org.apache.zookeeper.server.persistence.FileTxnLog.commit(FileTxnLog.java:30
5)
        at
org.apache.zookeeper.server.persistence.FileTxnSnapLog.commit(FileTxnSnapLog
.java:324)
        at
org.apache.zookeeper.server.ZKDatabase.commit(ZKDatabase.java:484)
        at
org.apache.zookeeper.server.SyncRequestProcessor.flush(SyncRequestProcessor.
java:158)
        at
org.apache.zookeeper.server.SyncRequestProcessor.run(SyncRequestProcessor.ja
va:98)
2011-06-21 10:09:59,732 - INFO  [Thread-2:QuorumPeer@691] - The Quorum
server is going for shutdown
2011-06-21 10:09:59,732 - INFO  [Thread-2:Leader@393] - Shutdown called
java.lang.Exception: shutdown Leader! reason: quorum Peer shutdown
        at
org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:393)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.shutdown(QuorumPeer.java:694)
        at
org.apache.zookeeper.server.quorum.QuorumPeerMain$1.run(QuorumPeerMain.java:
126)
2011-06-21 10:09:59,733 - INFO  [Thread-6:Leader$LearnerCnxAcceptor@243] -
exception while shutting down acceptor: java.net.SocketException: Socket
closed
2011-06-21 10:09:59,758 - INFO  [ProcessThread:-1:PrepRequestProcessor@1
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB