|
|
-
Leader election failure
Jared Cantwell 2012-07-27, 04:40
I have a 5 node cluster configured using dynamic zookeeper. It has been through several reconfigurations, but at the moment I am simply trying to start 3 of the nodes to get ZK accessible. I have confirmed that the myid files match the entries in the dynamic membership file for the 3 nodes in question. However, when I start up the three nodes I get the following error:
2012-07-26 22:26:01,037 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 :Leader@445] - LEADING - LEADER ELECTION TOOK - 13 2012-07-26 22:26:01,039 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 :FileSnap@83] - Reading snapshot /sf/data/zookeeper/ 10.10.5.27/version-2/snapshot.3000001e3 2012-07-26 22:26:01,065 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 :FileTxnSnapLog@270] - Snapshotting: 0x3000001e3 to /sf/data/zookeeper/ 10.10.5.27/version-2/snapshot.3000001e3 2012-07-26 22:26:10,837 [myid:8] - INFO [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config version) 2012-07-26 22:26:20,849 [myid:8] - INFO [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config version) 2012-07-26 22:26:21,083 [myid:8] - WARN [QuorumPeer[myid=8]/10.10.5.27:2181 :QuorumPeer@949] - Unexpected exception java.lang.InterruptedException: *Timeout while waiting for epoch from quorum * at org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1207) at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:464) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:946) 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 :Leader@614] - Shutting down 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 :Leader@620] - Shutdown called java.lang.Exception: shutdown Leader! reason: Forcing shutdown at org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:620) at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:952) 2012-07-26 22:26:21,084 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 :ZooKeeperServer@413] - shutting down 2012-07-26 22:26:21,084 [myid:8] - INFO [LearnerCnxAcceptor-0.0.0.0/0.0.0.0:2182:Leader$LearnerCnxAcceptor@407] - exception while shutting down acceptor: java.net.SocketException: Socket closed
I am not sure what to make of it or how to debug from here. Any pointers or suggestions on how to debug what might be wrong, or simply some usual causes of this error would be appreciated.
Thanks! Jared
+
Jared Cantwell 2012-07-27, 04:40
-
Re: Leader election failure
Hanno Schlichting 2012-07-27, 05:11
What version of ZK are yoy using? There's a bug in 3.4.x with 5 node clusters failing to agree on a leader. That's only solved in the yet unreleased 3.4.4.
Hanno
On 27.07.2012, at 06:40, Jared Cantwell <[EMAIL PROTECTED]> wrote:
> I have a 5 node cluster configured using dynamic zookeeper. It has been > through several reconfigurations, but at the moment I am simply trying to > start 3 of the nodes to get ZK accessible. I have confirmed that the myid > files match the entries in the dynamic membership file for the 3 nodes in > question. However, when I start up the three nodes I get the following > error: > > 2012-07-26 22:26:01,037 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 > :Leader@445] - LEADING - LEADER ELECTION TOOK - 13 > 2012-07-26 22:26:01,039 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 > :FileSnap@83] - Reading snapshot /sf/data/zookeeper/ > 10.10.5.27/version-2/snapshot.3000001e3 > 2012-07-26 22:26:01,065 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 > :FileTxnSnapLog@270] - Snapshotting: 0x3000001e3 to /sf/data/zookeeper/ > 10.10.5.27/version-2/snapshot.3000001e3 > 2012-07-26 22:26:10,837 [myid:8] - INFO > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config version) > 2012-07-26 22:26:20,849 [myid:8] - INFO > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config version) > 2012-07-26 22:26:21,083 [myid:8] - WARN [QuorumPeer[myid=8]/10.10.5.27:2181 > :QuorumPeer@949] - Unexpected exception > java.lang.InterruptedException: *Timeout while waiting for epoch from quorum > * > at > org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1207) > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:464) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:946) > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 > :Leader@614] - Shutting down > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 > :Leader@620] - Shutdown called > java.lang.Exception: shutdown Leader! reason: Forcing shutdown > at > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:620) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:952) > 2012-07-26 22:26:21,084 [myid:8] - INFO [QuorumPeer[myid=8]/10.10.5.27:2181 > :ZooKeeperServer@413] - shutting down > 2012-07-26 22:26:21,084 [myid:8] - INFO > [LearnerCnxAcceptor-0.0.0.0/0.0.0.0:2182:Leader$LearnerCnxAcceptor@407] - > exception while shutting down acceptor: java.net.SocketException: Socket > closed > > I am not sure what to make of it or how to debug from here. Any pointers > or suggestions on how to debug what might be wrong, or simply some usual > causes of this error would be appreciated. > > Thanks! > Jared
+
Hanno Schlichting 2012-07-27, 05:11
-
Re: Leader election failure
Jared Cantwell 2012-07-27, 05:20
We are currently testing out 3.5.0. If the fix made it into 3.4.4, I assume that issue is also fixed in 3.5.0?
~Jared
On Thu, Jul 26, 2012 at 11:11 PM, Hanno Schlichting <[EMAIL PROTECTED]>wrote:
> What version of ZK are yoy using? There's a bug in 3.4.x with 5 node > clusters failing to agree on a leader. That's only solved in the yet > unreleased 3.4.4. > > Hanno > > On 27.07.2012, at 06:40, Jared Cantwell <[EMAIL PROTECTED]> wrote: > > > I have a 5 node cluster configured using dynamic zookeeper. It has been > > through several reconfigurations, but at the moment I am simply trying to > > start 3 of the nodes to get ZK accessible. I have confirmed that the > myid > > files match the entries in the dynamic membership file for the 3 nodes in > > question. However, when I start up the three nodes I get the following > > error: > > > > 2012-07-26 22:26:01,037 [myid:8] - INFO [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :Leader@445] - LEADING - LEADER ELECTION TOOK - 13 > > 2012-07-26 22:26:01,039 [myid:8] - INFO [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :FileSnap@83] - Reading snapshot /sf/data/zookeeper/ > > 10.10.5.27/version-2/snapshot.3000001e3 > > 2012-07-26 22:26:01,065 [myid:8] - INFO [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :FileTxnSnapLog@270] - Snapshotting: 0x3000001e3 to /sf/data/zookeeper/ > > 10.10.5.27/version-2/snapshot.3000001e3 > > 2012-07-26 22:26:10,837 [myid:8] - INFO > > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 > > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 > > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config > version) > > 2012-07-26 22:26:20,849 [myid:8] - INFO > > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 > > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 > > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config > version) > > 2012-07-26 22:26:21,083 [myid:8] - WARN [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :QuorumPeer@949] - Unexpected exception > > java.lang.InterruptedException: *Timeout while waiting for epoch from > quorum > > * > > at > > > org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1207) > > at org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:464) > > at > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:946) > > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :Leader@614] - Shutting down > > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :Leader@620] - Shutdown called > > java.lang.Exception: shutdown Leader! reason: Forcing shutdown > > at > > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:620) > > at > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:952) > > 2012-07-26 22:26:21,084 [myid:8] - INFO [QuorumPeer[myid=8]/ > 10.10.5.27:2181 > > :ZooKeeperServer@413] - shutting down > > 2012-07-26 22:26:21,084 [myid:8] - INFO > > [LearnerCnxAcceptor-0.0.0.0/0.0.0.0:2182:Leader$LearnerCnxAcceptor@407] > - > > exception while shutting down acceptor: java.net.SocketException: Socket > > closed > > > > I am not sure what to make of it or how to debug from here. Any pointers > > or suggestions on how to debug what might be wrong, or simply some usual > > causes of this error would be appreciated. > > > > Thanks! > > Jared >
+
Jared Cantwell 2012-07-27, 05:20
-
Re: Leader election failure
Jared Cantwell 2012-07-27, 14:54
It looks like I had a firewall issue with the leader election port on one of the nodes. Sorry for the spam.
~Jared
On Thu, Jul 26, 2012 at 11:20 PM, Jared Cantwell <[EMAIL PROTECTED]>wrote:
> We are currently testing out 3.5.0. If the fix made it into 3.4.4, I > assume that issue is also fixed in 3.5.0? > > ~Jared > > > On Thu, Jul 26, 2012 at 11:11 PM, Hanno Schlichting <[EMAIL PROTECTED]>wrote: > >> What version of ZK are yoy using? There's a bug in 3.4.x with 5 node >> clusters failing to agree on a leader. That's only solved in the yet >> unreleased 3.4.4. >> >> Hanno >> >> On 27.07.2012, at 06:40, Jared Cantwell <[EMAIL PROTECTED]> wrote: >> >> > I have a 5 node cluster configured using dynamic zookeeper. It has been >> > through several reconfigurations, but at the moment I am simply trying >> to >> > start 3 of the nodes to get ZK accessible. I have confirmed that the >> myid >> > files match the entries in the dynamic membership file for the 3 nodes >> in >> > question. However, when I start up the three nodes I get the following >> > error: >> > >> > 2012-07-26 22:26:01,037 [myid:8] - INFO [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :Leader@445] - LEADING - LEADER ELECTION TOOK - 13 >> > 2012-07-26 22:26:01,039 [myid:8] - INFO [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :FileSnap@83] - Reading snapshot /sf/data/zookeeper/ >> > 10.10.5.27/version-2/snapshot.3000001e3 >> > 2012-07-26 22:26:01,065 [myid:8] - INFO [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :FileTxnSnapLog@270] - Snapshotting: 0x3000001e3 to /sf/data/zookeeper/ >> > 10.10.5.27/version-2/snapshot.3000001e3 >> > 2012-07-26 22:26:10,837 [myid:8] - INFO >> > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 >> > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 >> > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config >> version) >> > 2012-07-26 22:26:20,849 [myid:8] - INFO >> > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 >> > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 >> > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config >> version) >> > 2012-07-26 22:26:21,083 [myid:8] - WARN [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :QuorumPeer@949] - Unexpected exception >> > java.lang.InterruptedException: *Timeout while waiting for epoch from >> quorum >> > * >> > at >> > >> org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1207) >> > at >> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:464) >> > at >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:946) >> > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :Leader@614] - Shutting down >> > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :Leader@620] - Shutdown called >> > java.lang.Exception: shutdown Leader! reason: Forcing shutdown >> > at >> > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:620) >> > at >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:952) >> > 2012-07-26 22:26:21,084 [myid:8] - INFO [QuorumPeer[myid=8]/ >> 10.10.5.27:2181 >> > :ZooKeeperServer@413] - shutting down >> > 2012-07-26 22:26:21,084 [myid:8] - INFO >> > [LearnerCnxAcceptor-0.0.0.0/0.0.0.0:2182:Leader$LearnerCnxAcceptor@407] >> - >> > exception while shutting down acceptor: java.net.SocketException: Socket >> > closed >> > >> > I am not sure what to make of it or how to debug from here. Any >> pointers >> > or suggestions on how to debug what might be wrong, or simply some usual >> > causes of this error would be appreciated. >> > >> > Thanks! >> > Jared >> > >
+
Jared Cantwell 2012-07-27, 14:54
-
Re: Leader election failure
David Nickerson 2012-07-27, 17:01
> > Sorry for the spam. Other people may run into the same problem with leader election, search for it, and come across your post. On behalf of them, thanks for posting your solution.
On Fri, Jul 27, 2012 at 10:54 AM, Jared Cantwell <[EMAIL PROTECTED]>wrote:
> It looks like I had a firewall issue with the leader election port on one > of the nodes. Sorry for the spam. > > ~Jared > > On Thu, Jul 26, 2012 at 11:20 PM, Jared Cantwell > <[EMAIL PROTECTED]>wrote: > > > We are currently testing out 3.5.0. If the fix made it into 3.4.4, I > > assume that issue is also fixed in 3.5.0? > > > > ~Jared > > > > > > On Thu, Jul 26, 2012 at 11:11 PM, Hanno Schlichting <[EMAIL PROTECTED] > >wrote: > > > >> What version of ZK are yoy using? There's a bug in 3.4.x with 5 node > >> clusters failing to agree on a leader. That's only solved in the yet > >> unreleased 3.4.4. > >> > >> Hanno > >> > >> On 27.07.2012, at 06:40, Jared Cantwell <[EMAIL PROTECTED]> > wrote: > >> > >> > I have a 5 node cluster configured using dynamic zookeeper. It has > been > >> > through several reconfigurations, but at the moment I am simply trying > >> to > >> > start 3 of the nodes to get ZK accessible. I have confirmed that the > >> myid > >> > files match the entries in the dynamic membership file for the 3 nodes > >> in > >> > question. However, when I start up the three nodes I get the > following > >> > error: > >> > > >> > 2012-07-26 22:26:01,037 [myid:8] - INFO [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :Leader@445] - LEADING - LEADER ELECTION TOOK - 13 > >> > 2012-07-26 22:26:01,039 [myid:8] - INFO [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :FileSnap@83] - Reading snapshot /sf/data/zookeeper/ > >> > 10.10.5.27/version-2/snapshot.3000001e3 > >> > 2012-07-26 22:26:01,065 [myid:8] - INFO [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :FileTxnSnapLog@270] - Snapshotting: 0x3000001e3 to > /sf/data/zookeeper/ > >> > 10.10.5.27/version-2/snapshot.3000001e3 > >> > 2012-07-26 22:26:10,837 [myid:8] - INFO > >> > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 > >> > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 > >> > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config > >> version) > >> > 2012-07-26 22:26:20,849 [myid:8] - INFO > >> > [WorkerReceiver[myid=8]:FastLeaderElection@635] - Notification: 8 > >> > (n.leader), 0x3000001e3 (n.zxid), 0x1 (n.round), LOOKING (n.state), 9 > >> > (n.sid), 0x3 (n.peerEPoch), LEADING (my state)300000147 (n.config > >> version) > >> > 2012-07-26 22:26:21,083 [myid:8] - WARN [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :QuorumPeer@949] - Unexpected exception > >> > java.lang.InterruptedException: *Timeout while waiting for epoch from > >> quorum > >> > * > >> > at > >> > > >> > org.apache.zookeeper.server.quorum.Leader.getEpochToPropose(Leader.java:1207) > >> > at > >> org.apache.zookeeper.server.quorum.Leader.lead(Leader.java:464) > >> > at > >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:946) > >> > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :Leader@614] - Shutting down > >> > 2012-07-26 22:26:21,083 [myid:8] - INFO [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :Leader@620] - Shutdown called > >> > java.lang.Exception: shutdown Leader! reason: Forcing shutdown > >> > at > >> > org.apache.zookeeper.server.quorum.Leader.shutdown(Leader.java:620) > >> > at > >> > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:952) > >> > 2012-07-26 22:26:21,084 [myid:8] - INFO [QuorumPeer[myid=8]/ > >> 10.10.5.27:2181 > >> > :ZooKeeperServer@413] - shutting down > >> > 2012-07-26 22:26:21,084 [myid:8] - INFO > >> > [LearnerCnxAcceptor-0.0.0.0/0.0.0.0:2182 > :Leader$LearnerCnxAcceptor@407] > >> - > >> > exception while shutting down acceptor: java.net.SocketException: > Socket > >> > closed > >> > > >> > I am not sure what to make of it or how to debug from here. Any
+
David Nickerson 2012-07-27, 17:01
|
|