Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Connection closed exceptions with slow fsync and CancelledKeyExceptions


Copy link to this message
-
Connection closed exceptions with slow fsync and CancelledKeyExceptions
We have been trying to understand why our ZooKeeper cluster will occasionally
have a wave of connection closed exceptions. We have switched to
-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode for garbage collection with
no noticeable improvements.

The symptoms are:

(1) All nodes show messages like "fsync-ing the write ahead log in
SyncThread:0 took 6309ms which will adversely effect operation latency. See
the ZooKeeper troubleshooting guide" with times typically around 5 seconds.
At least once, this fsync appeared in the leaders log immediately before a
wave of:

ERROR [CommitProcessor:0:NIOServerCnxn@445] - Unexpected Exception:
java.nio.channels.CancelledKeyException
at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
at
org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418)
at
org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509)
at
org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171)
at
org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73)

Our clients received ZookeeperConnectionClosed exceptions at this time and
all traffic on the ZooKeeper cluster essentially went to zero for a moment
before resuming normal operation with new connections.

(2) Probably unrelated since I haven't correlated it temporally with the
client errors, but running "sudo strace -r -T -f -p 9574 -e
trace=fsync,fdatasync -o trace.txt" turns up some messages like "10581
0.000246 — SIGSEGV (Segmentation fault) @ 0 (0) —"
ZK Version: 3.3.4
Cluster has 5 nodes running in EC2

Here is a screenshot showing ZooKeeper network traffic going to zero at the
time of the connection closed exceptions: http://i.imgur.com/dfNh0.png

Anyone have ideas on what the cause of these "waves" of
CancelledKeyExceptions could be from?

--
View this message in context: http://zookeeper-user.578899.n2.nabble.com/Connection-closed-exceptions-with-slow-fsync-and-CancelledKeyExceptions-tp7578166.html
Sent from the zookeeper-user mailing list archive at Nabble.com.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB