Thanks for the report Scott, from what I've seen so far this seems to
be a Linux bug and not specific to java/ZK, here are a couple of the
more informative link's I've seen:
Anyone have specific insight into how this expressed itself in java?
I've seen some references to futex being the root (from java
perspective) "It's a critical Linux bug that causes futex to timeout,
and anything that uses it to behave incorrectly."
On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <[EMAIL PROTECTED]> wrote:
> Hello all,
> It appears that ZooKeeper is subject to the linux leap seconds bug that has caused problems with Cassandra and other services. At least, I discovered that after 6 hours of trying to figure out why my cluster wasn't giving me a quorum.
> A link to the kernel bug report is at https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
> As far as what you might see in your logs, I saw a lost quorum, insanely high load on my servers, and when I shut down zookeeper to bring it back up, one machine would report a read timeout during leader election, then report that the server told it to shut down. After that, it would forever be stuck in the LOOKING phase, while another machine might be stuck in any other phase of the election.
> The fix is simple, though. Just stop ZooKeeper, execute
> date -s "`date`"
> or restart your ntp daemon, then start zookeeper back up.
> you MUST restart zookeeper, otherwise, the election state doesn't recover (or, at least, it didn't recover for me)
> Hope this helps save someone else the 7 hours of agony I just went through.
> Scott Fines