Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper, mail # user - ZooKeeper Cluster Crash resulted in not loadable database


+
Gunnar Wagenknecht 2012-09-05, 07:43
Copy link to this message
-
Re: ZooKeeper Cluster Crash resulted in not loadable database
Camille Fournier 2012-09-05, 19:41
You can try running them through org.apache.zookeeper.server.LogFormatter
and see what comes out. That's where I would start.

C

On Wed, Sep 5, 2012 at 3:43 AM, Gunnar Wagenknecht
<[EMAIL PROTECTED]>wrote:

> Hi,
>
> I'm investigating a crash of a ZooKeeper 3.3.4 cluster. It seems that
> the cause of the crash was an issue in the networking layer. All the ZK
> server suddenly lost connections to clients as well as all between
> themselves. Only a few seconds later, all ZooKeeper servers had issues
> loading their database because of the following exception.
>
> ERROR [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@224]
> Failed to increment parent cversion for: /a/b/c
> org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode > NoNode for /a/b/c
> at DataTree.incrementCversion(DataTree.java:1218)
> at FileTxnSnapLog.processTransaction(FileTxnSnapLog.java:222)
> at FileTxnSnapLog.restore(FileTxnSnapLog.java:150)
> at ZKDatabase.loadDataBase(ZKDatabase.java:222)
> at QuorumPeer.getLastLoggedZxid(QuorumPeer.java:493)
> at FastLeaderElection.getInitLastLoggedZxid(FastLeaderElection.java:632)
> at FastLeaderElection.lookForLeader(FastLeaderElection.java:660)
> at QuorumPeer.run(QuorumPeer.java:622)
>
> WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@497]
> Unable to load database
>
> Note that the path "/a/b/c" was different on all servers. Thus, each
> server tried to restore a different transaction.
>
> The only way I was able to bring the cluster back online was to delete
> all the transaction logs on all servers and start with the latest snapshot.
>
> I have all the logs and snapshots available for investigation. Are there
> any tools to help an investigation? I'd like to find out how such a
> network outage could possibly cause such an inconsistent/instable state
> in the system. I noticed a few stability fixes in 3.3.5/3.3.6. Thus, an
> upgrade is already scheduled.
>
> Any help is appreciated.
>
> -Gunnar
>
>
>
> --
> Gunnar Wagenknecht
> [EMAIL PROTECTED]
> http://wagenknecht.org/
>
>