|
|
-
Re: new leader accepting create requests too early?Flavio Junqueira 2011-03-22, 06:55
Hi Jeremy, Thanks for reporting. Without logs from the other nodes, it
is somewaht difficult to determine more precisely what happened. But, from the logs, it sounds like node #1 (id 215) is the only leading from these two notifications: 2672 [WorkerReceiver Thread] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification: 215 (n.leader), 12884902548 (n.zxid), 3 (n.round), LEADING (n.state), 215 (n.sid), LOOKING (my state) 2672 [WorkerReceiver Thread] INFO org.apache.zookeeper.server.quorum.FastLeaderElection - Notification: 215 (n.leader), 12884902548 (n.zxid), 3 (n.round), FOLLOWING (n.state), 126 (n.sid), LEADING (my state) From the information you have given, I can't explain right now why the sequence number is smaller in step 12. If you happen to have more information, it would be great if you could report in a jira. -Flavio PS: We should perhaps discuss this issue on dev, so I'm copying it to dev. On Mar 22, 2011, at 3:15 AM, Jeremy Stribling wrote: > Hi all, > > I ran into a weird case where Zookeeper seems to have elected a new > leader (a node that just restarted and missed several operations), > and a > client is able to connect to it and create a new sequential node that > has a number earlier than the last node it created. I don't have full > logs, or a live system in this state, or any data directories, just > some > partial server logs and the evidence as seen by the client. Haven't > tried reproducing it yet, just wanted to see if anyone here had any > ideas. Here's the scenario (probably more info than necessary, but > trying to be complete) > > Version: Zookeeper 3.3.3 > > 1) Initially (5:37:20): 3 nodes up, with ids 215, 126, and 37 (called > nodes #1, #2, and #3 below): > 2) Nodes periodically (and throughout this whole timeline) create > sequential, non-ephemeral nodes under the /zkrsm parent node. > 3) 5:46:57: Node #1 gets notified of > /zkrsm/0000000000000000_record0000002116 > 4) 5:47:06: Node #1 restarts and rejoins > 5) 5:49:26: Node #2 gets notified of > /zkrsm/0000000000000000_record0000002708 > 6) 5:49:29: Node #2 restarts and rejoins > 7) 5:52:01: Node #3 gets notified of > /zkrsm/0000000000000000_record0000003291 > 8) 5:52:02: Node #3 restarts and begins the rejoining process > 9) 5:52:08: Node #1 successfully creates > /zkrsm/0000000000000000_record0000003348 > 10) 5:52:08: Node #2 dies after getting notified of > /zkrsm/0000000000000000_record0000003348 > 11) 5:52:10ish: Node #3 is elected leader (the ZK server log doesn't > have wallclock timestamps, so not exactly sure on the ordering of > this step) > 12) 5:52:15: Node #1 successfully creates > /zkrsm/0000000000000000_record0000003292 > > Note that the node created in step #12 is lower than the one created > in > step #9, and is exactly one greater than the last node seen by node #3 > before it restarted. > > The leader election bit from node #3's log after restarting might be > of > interest: > >> >> 2644 [QuorumPeer:/0.0.0.0:2888] INFO >> org.apache.zookeeper.server.quorum.QuorumPeer - LOOKING >> 2644 [QuorumPeer:/0.0.0.0:2888] DEBUG >> org.apache.zookeeper.server.quorum.QuorumPeer - Initializing leader >> election protocol... >> 2647 [QuorumPeer:/0.0.0.0:2888] INFO >> org.apache.zookeeper.server.quorum.FastLeaderElection - New >> election. >> My id = 37, Proposed zxid = 17179869831 >> 2650 [WorkerSender Thread] DEBUG >> org.apache.zookeeper.server.quorum.QuorumCnxManager - Opening >> channel >> to server 126 >> 2653 [WorkerReceiver Thread] DEBUG >> org.apache.zookeeper.server.quorum.FastLeaderElection - Receive new >> notification message. My id = 37 >> 2653 [WorkerReceiver Thread] INFO >> org.apache.zookeeper.server.quorum.FastLeaderElection - >> Notification: >> 37 (n.leader), 17179869831 (n.zxid), 1 (n.round), LOOKING (n.state), >> 37 (n.sid), LOOKING (my state) >> 2654 [QuorumPeer:/0.0.0.0:2888] DEBUG >> org.apache.zookeeper.server.quorum.FastLeaderElection - id: 37, >> proposed id: 37, zxid: 17179869831, proposed zxid: 17179869831 flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301 |