|
|
-
Connection closed exceptions with slow fsync and CancelledKeyExceptionsmattgordon 2012-11-08, 01:48
We have been trying to understand why our ZooKeeper cluster will occasionally
have a wave of connection closed exceptions. We have switched to -XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode for garbage collection with no noticeable improvements. The symptoms are: (1) All nodes show messages like "fsync-ing the write ahead log in SyncThread:0 took 6309ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide" with times typically around 5 seconds. At least once, this fsync appeared in the leaders log immediately before a wave of: ERROR [CommitProcessor:0:NIOServerCnxn@445] - Unexpected Exception: java.nio.channels.CancelledKeyException at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73) at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77) at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:418) at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1509) at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:171) at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:73) Our clients received ZookeeperConnectionClosed exceptions at this time and all traffic on the ZooKeeper cluster essentially went to zero for a moment before resuming normal operation with new connections. (2) Probably unrelated since I haven't correlated it temporally with the client errors, but running "sudo strace -r -T -f -p 9574 -e trace=fsync,fdatasync -o trace.txt" turns up some messages like "10581 0.000246 — SIGSEGV (Segmentation fault) @ 0 (0) —" ZK Version: 3.3.4 Cluster has 5 nodes running in EC2 Here is a screenshot showing ZooKeeper network traffic going to zero at the time of the connection closed exceptions: http://i.imgur.com/dfNh0.png Anyone have ideas on what the cause of these "waves" of CancelledKeyExceptions could be from? -- View this message in context: http://zookeeper-user.578899.n2.nabble.com/Connection-closed-exceptions-with-slow-fsync-and-CancelledKeyExceptions-tp7578166.html Sent from the zookeeper-user mailing list archive at Nabble.com. +
Vitalii Tymchyshyn 2012-11-12, 00:12
|