Zookeeper, mail # user - Failure to rejoin ensemble after reboot

Marshall McMullen 2012-07-09, 04:44
I'm trying to get to the bottom of a problem we're seeing where after I
forcibly reboot an ensemble node (running on Linux) via "reboot -f" it is
unable to rejoin the ensemble and no clients can connect to it. Has anyone
ever seen a problem like this before?

I have been investigating this under
https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface it
looked like there was some sort of transaction/log corruption going on. But
now I'm not so sure of that.

What bothers me the most right now is that I am unable to reliably get the
node in question to rejoin the ensemble. I've removed the contents of the
"version-2" directory and restarted zookeeper to no avail. It regenerates
an epoch file but never obtains the new database from a peer. I event went
so far as to copy the on-disk database from another node and restart
zookeeper and I still can't get it to rejoin the ensemble. I've also
seen anomalous behavior where once I get it into this failed state, I just
stopped all three zookeeper server processes entirely then start them all
back up... then everything connects and all three nodes are in the
ensemble. But this really shouldn't be necessary.

None of this matches the behavior I expected. Anyone have any insight it
would be greatly appreciated.
Camille Fournier 2012-07-09, 14:09
Marshall McMullen 2012-07-09, 14:14
Camille Fournier 2012-07-09, 14:16
Patrick Hunt 2012-07-09, 16:48
Marshall McMullen 2012-07-09, 14:19