|
|
-
Corrupt files across all nodes in a 3 node cluster on shutdownTodd Nine 2011-04-11, 08:16
Hi all,
I'm running version 3.3.2 with 3 nodes. The workload of ZK itself is very small, it's primarily used for leader/follower election and low volume inter node communication(1 or 2 messages per second). We're hosted on Amazon EC2 using a raid EBS. I've noticed on several occasions that when we shutdown all 3 nodes at once, we eventually reach a state where zookeeper cannot be restarted. I attach the EBS drives, re-attach the raid volumes and mount the file systems successfully. However, all 3 nodes fail to start. I consistently receive errors such as the one included below. The only resolution seems to be to completely remove all the data from the /mnt/zookeeperdata/version-2 directory. Any ideas why this is happening? We're migrating our quartz implementation over to ZK for job and trigger coordination in the cluster. Once that happens we can't have data loss. Any help would be greatly appreciated. Thanks, Todd Stack Track 011-04-11 08:00:25,572 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /mnt/zookeeperdata/version-2 snapdir /mnt/zookeeperdata/version-2 2011-04-11 08:00:25,576 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot /mnt/zookeeperdata/version-2/snapshot.200000000 2011-04-11 08:00:25,580 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@208] - Snapshotting: 200000047 2011-04-11 08:00:25,602 - INFO [LearnerHandler-/10.161.98.28:59536 :LearnerHandler@247] - Follower sid: 0 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServ er@21b64e6a 2011-04-11 08:00:25,603 - WARN [LearnerHandler-/10.161.98.28:59536 :LearnerHandler@326] - Sending snapshot last zxid of peer is 0x200000047 zxid of leader is 0x300000000 2011-04-11 08:00:25,606 - WARN [LearnerHandler-/10.161.98.28:59536 :Leader@471] - Commiting zxid 0x300000000 from /10.160.137.155:2888 not first! 2011-04-11 08:00:25,606 - WARN [LearnerHandler-/10.161.98.28:59536 :Leader@473] - First is 0 2011-04-11 08:00:25,606 - INFO [LearnerHandler-/10.161.98.28:59536 :Leader@497] - Have quorum of supporters; starting up and setting last processed zxid: 12884901888 2011-04-11 08:01:19,835 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 352192519805 (n.zxid), 1 (n.round), LOOKING (n.state), 2 (n.sid), LEADING (my state) 2011-04-11 08:02:33,688 - INFO [WorkerReceiver Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 2 (n.round), LOOKING (n.state), 2 (n.sid), LEADING (my state) 2011-04-11 08:02:33,700 - INFO [LearnerHandler-/10.160.246.112:40243 :LearnerHandler@247] - Follower sid: 2 : info : org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer @bd10a5c 2011-04-11 08:02:33,701 - WARN [LearnerHandler-/10.160.246.112:40243 :LearnerHandler@326] - Sending snapshot last zxid of peer is 0x0 zxid of leader is 0x300000000 2011-04-11 08:02:34,283 - ERROR [LearnerHandler-/10.160.246.112:40243 :LearnerHandler@466] - Unexpected exception causing shutdown while sock still open java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84) at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108) at org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:380) 2011-04-11 08:02:34,284 - WARN [Thread-7:QuorumCnxManager$RecvWorker@702] - Connection broken for id 2, my id = 1, error = java.io.IOException: Channel eof 2011-04-11 08:02:34,285 - WARN [LearnerHandler-/10.160.246.112:40243 :LearnerHandler@479] - ******* GOODBYE /10.160.246.112:40243 ******** 2011-04-11 08:02:59,364 - WARN [Thread-6:QuorumCnxManager$SendWorker@612] - Interrupted while waiting for message on queue java.lang.InterruptedException at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038) at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342) at org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:601) zoo.cfg # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/mnt/zookeeperdata # the port at which the clients will connect clientPort=2181 #the max number of client connections maxClientCnxns=50 #peer_port2888leader_port3888ipaddress10.161.98.28 server.0=10.161.98.28:2888:3888 #peer_port2888leader_port3888ipaddress10.160.137.155 server.1=10.160.137.155:2888:3888 #peer_port2888leader_port3888ipaddress10.160.246.112 server.2=10.160.246.112:2888:3888 +
Vishal Kher 2011-04-11, 14:53
|