Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Zookeepers going out of sync


Copy link to this message
-
Zookeepers going out of sync
Hi,

I have 5 box zookeeper quorum. 2 / 5 are unable to deserialize the message from leader and hence they are not coming up. I tried restarting few times but still seeing the same issue. This issue is observed when we were trying to do the following:
1>   We have a script which brings down the zookeeper server on the leader. We are trying to test if new follower takes over once leader is down.

2>   The script has been running since 2-3 days. Surprisingly we don't see any issue for the first 2 days but after that we hit the below issue wherein 2 out 5 zookeeper machines are unable to deserialize the message.

zookeeper server = 3.4.5

Following are the zookeeper settings:

tickTime=4000
initLimit=20
syncLimit=15
snapCount=100000

Logs from Follower

---------------

Thu Mar 28 19:14:14 2013: 2013-03-28 19:14:14,158 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3
(n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - WARN
[QuorumPeer:/0.0.0.0:2181:Follower@82] - Exception when following the leader
Thu Mar 28 19:14:41 2013: java.io.EOFException
Thu Mar 28 19:14:41 2013:       at
java.io.DataInputStream.readInt(DataInputStream.java:392)
Thu Mar 28 19:14:41 2013:       at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
Thu Mar 28 19:14:41 2013:       at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:148)
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:332)
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:75)
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:645)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,811 - INFO
[QuorumPeer:/0.0.0.0:2181:Follower@165] - shutdown called
Thu Mar 28 19:14:41 2013: java.lang.Exception: shutdown Follower
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:165)
Thu Mar 28 19:14:41 2013:       at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:649)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO
[QuorumPeer:/0.0.0.0:2181:QuorumPeer@621] - LOOKING
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,812 - INFO
[QuorumPeer:/0.0.0.0:2181:FastLeaderElection@663] - New election. My id =  1,
Proposed zxid = 0
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,813 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 1 (n.leader), 0 (n.zxid), 3
(n.round), LOOKING (n.state), 1 (n.sid), LOOKING (my state)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051
(n.zxid), 2 (n.round), FOLLOWING (n.state), 3 (n.sid), LOOKING (my state)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051
(n.zxid), 2 (n.round), FOLLOWING (n.state), 2 (n.sid), LOOKING (my state)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051
(n.zxid), 2 (n.round), LEADING (n.state), 4 (n.sid), LOOKING (my state)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 4 (n.leader), 85899346051
(n.zxid), 2 (n.round), FOLLOWING (n.state), 5 (n.sid), FOLLOWING (my state)
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,814 - INFO
[QuorumPeer:/0.0.0.0:2181:QuorumPeer@643] - FOLLOWING
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,815 - INFO
[QuorumPeer:/0.0.0.0:2181:ZooKeeperServer@151] - Created server with tickTime
2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
/home/y/var/zookeeper/version-2 snapdir /home/y/var/zookeeper/version-2
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,817 - INFO
[QuorumPeer:/0.0.0.0:2181:Learner@294] - Getting a snapshot from leader
Thu Mar 28 19:14:41 2013: 2013-03-28 19:14:41,819 - INFO
[QuorumPeer:/0.0.0.0:2181:Learner@325] - Setting leader epoch 15
Thu Mar 28 19:14:54 2013: 2013-03-28 19:14:54,208 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 3
(n.round), LOOKING (n.state), 2 (n.sid), FOLLOWING (my state)
Logs from Leader
Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,416 - ERROR
[LearnerHandler-/98.139.207.73:37505:LearnerHandler@562] - Unexpected exception
causing shutdown while sock still open
Thu Mar 28 19:18:14 2013: java.net.SocketTimeoutException: Read timed out
Thu Mar 28 19:18:14 2013:       at
java.net.SocketInputStream.socketRead0(Native Method)
Thu Mar 28 19:18:14 2013:       at
java.net.SocketInputStream.read(SocketInputStream.java:150)
Thu Mar 28 19:18:14 2013:       at
java.net.SocketInputStream.read(SocketInputStream.java:121)
Thu Mar 28 19:18:14 2013:       at
java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
Thu Mar 28 19:18:14 2013:       at
java.io.BufferedInputStream.read(BufferedInputStream.java:254)
Thu Mar 28 19:18:14 2013:       at
java.io.DataInputStream.readInt(DataInputStream.java:387)
Thu Mar 28 19:18:14 2013:       at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
Thu Mar 28 19:18:14 2013:       at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
Thu Mar 28 19:18:14 2013:       at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
Thu Mar 28 19:18:14 2013:       at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
Thu Mar 28 19:18:14 2013: 2013-03-28 19:18:14,417 - WARN
[LearnerHandler-/98.139.207.73:37505