Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper >> mail # user >> leap second excitement


+
Scott Fines 2012-07-01, 21:58
+
Patrick Hunt 2012-07-02, 16:36
Copy link to this message
-
Re: leap second excitement
Patrick, agreed. I've seen additional threads referencing this thread and
thought I would follow-up with what I've learned since.

Due to a missed function call in the Linux timekeeping code, the leap
second was not accounted for properly. As a result, after the leap second,
timers expired one second earlier than requested. Many applications use a
recurring timer of 1 second or less; such timers expired immediately,
causing the application to immediately try to set another timer, ad
infinitum. This infinite loop led to CPU load spikes.

In case of interest, we wrote a blog post detailing it:
http://www.cloudera.com/blog/2012/07/watching-the-clock-clouderas-response-to-leap-second-troubles/

Regards, Kathleen

On Mon, Jul 2, 2012 at 9:36 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote:

> Thanks for the report Scott, from what I've seen so far this seems to
> be a Linux bug and not specific to java/ZK, here are a couple of the
> more informative link's I've seen:
> http://hackerne.ws/item?id=4188412
>
> http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix
>
> Anyone have specific insight into how this expressed itself in java?
> I've seen some references to futex being the root (from java
> perspective) "It's a critical Linux bug that causes futex to timeout,
> and anything that uses it to behave incorrectly."
>
> Patrick
>
> On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <[EMAIL PROTECTED]> wrote:
> > Hello all,
> >
> > It appears that ZooKeeper is subject to the linux leap seconds bug that
> has caused problems with Cassandra and other services. At least, I
> discovered that after 6 hours of trying to figure out why my cluster wasn't
> giving me a quorum.
> >
> > A link to the kernel bug report is  at
> https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
> >
> > As far as what you might see in your logs, I saw a lost quorum, insanely
> high load on my servers, and when I shut down zookeeper to bring it back
> up, one machine would report a read timeout during leader election, then
> report that the server told it to shut down. After that, it would forever
> be stuck in the LOOKING phase, while another machine might be stuck in any
> other phase of the election.
> >
> > The fix is simple, though. Just stop ZooKeeper, execute
> >
> > date -s "`date`"
> >
> > or restart your ntp daemon, then start zookeeper back up.
> >
> > you MUST restart zookeeper, otherwise, the election state doesn't
> recover (or, at least, it didn't recover for me)
> >
> > Hope this helps save someone else the 7 hours of agony I just went
> through.
> >
> > Scott Fines
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB