Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> leap second excitement


Copy link to this message
-
Re: leap second excitement
Patrick, agreed. I've seen additional threads referencing this thread and
thought I would follow-up with what I've learned since.

Due to a missed function call in the Linux timekeeping code, the leap
second was not accounted for properly. As a result, after the leap second,
timers expired one second earlier than requested. Many applications use a
recurring timer of 1 second or less; such timers expired immediately,
causing the application to immediately try to set another timer, ad
infinitum. This infinite loop led to CPU load spikes.

In case of interest, we wrote a blog post detailing it:
http://www.cloudera.com/blog/2012/07/watching-the-clock-clouderas-response-to-leap-second-troubles/

Regards, Kathleen

On Mon, Jul 2, 2012 at 9:36 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote:

> Thanks for the report Scott, from what I've seen so far this seems to
> be a Linux bug and not specific to java/ZK, here are a couple of the
> more informative link's I've seen:
> http://hackerne.ws/item?id=4188412
>
> http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix
>
> Anyone have specific insight into how this expressed itself in java?
> I've seen some references to futex being the root (from java
> perspective) "It's a critical Linux bug that causes futex to timeout,
> and anything that uses it to behave incorrectly."
>
> Patrick
>
> On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <[EMAIL PROTECTED]> wrote:
> > Hello all,
> >
> > It appears that ZooKeeper is subject to the linux leap seconds bug that
> has caused problems with Cassandra and other services. At least, I
> discovered that after 6 hours of trying to figure out why my cluster wasn't
> giving me a quorum.
> >
> > A link to the kernel bug report is  at
> https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
> >
> > As far as what you might see in your logs, I saw a lost quorum, insanely
> high load on my servers, and when I shut down zookeeper to bring it back
> up, one machine would report a read timeout during leader election, then
> report that the server told it to shut down. After that, it would forever
> be stuck in the LOOKING phase, while another machine might be stuck in any
> other phase of the election.
> >
> > The fix is simple, though. Just stop ZooKeeper, execute
> >
> > date -s "`date`"
> >
> > or restart your ntp daemon, then start zookeeper back up.
> >
> > you MUST restart zookeeper, otherwise, the election state doesn't
> recover (or, at least, it didn't recover for me)
> >
> > Hope this helps save someone else the 7 hours of agony I just went
> through.
> >
> > Scott Fines
>