That is very strange. What do the logs of the misbehaving server say? What
do the logs of the other servers say? What does a stack dump of the
misbehaving server look like?
Also, just to clarify, if you don't do anything but fully stop and restart
the cluster (no deleting version-2 files etc) the whole ensemble will
On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen <
[EMAIL PROTECTED]> wrote:
> I'm trying to get to the bottom of a problem we're seeing where after I
> forcibly reboot an ensemble node (running on Linux) via "reboot -f" it is
> unable to rejoin the ensemble and no clients can connect to it. Has anyone
> ever seen a problem like this before?
> I have been investigating this under
> https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface it
> looked like there was some sort of transaction/log corruption going on. But
> now I'm not so sure of that.
> What bothers me the most right now is that I am unable to reliably get the
> node in question to rejoin the ensemble. I've removed the contents of the
> "version-2" directory and restarted zookeeper to no avail. It regenerates
> an epoch file but never obtains the new database from a peer. I event went
> so far as to copy the on-disk database from another node and restart
> zookeeper and I still can't get it to rejoin the ensemble. I've also
> seen anomalous behavior where once I get it into this failed state, I just
> stopped all three zookeeper server processes entirely then start them all
> back up... then everything connects and all three nodes are in the
> ensemble. But this really shouldn't be necessary.
> None of this matches the behavior I expected. Anyone have any insight it
> would be greatly appreciated.