I'm investigating a crash of a ZooKeeper 3.3.4 cluster. It seems that
the cause of the crash was an issue in the networking layer. All the ZK
server suddenly lost connections to clients as well as all between
themselves. Only a few seconds later, all ZooKeeper servers had issues
loading their database because of the following exception.
Failed to increment parent cversion for: /a/b/c
org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode NoNode for /a/b/c
Unable to load database
Note that the path "/a/b/c" was different on all servers. Thus, each
server tried to restore a different transaction.
The only way I was able to bring the cluster back online was to delete
all the transaction logs on all servers and start with the latest snapshot.
I have all the logs and snapshots available for investigation. Are there
any tools to help an investigation? I'd like to find out how such a
network outage could possibly cause such an inconsistent/instable state
in the system. I noticed a few stability fixes in 3.3.5/3.3.6. Thus, an
upgrade is already scheduled.
Any help is appreciated.