|
|
-
server cannot join quorum
Alexis Midon 2011-01-07, 00:32
Hi there,
I have a cluster of 3 machines, running zookeeper 3.3.1. zk1 fails to join the quorum while zk2 and zk3 interact correctly. zk1 is stuck in the election loop. See the log below. I checked the config files, the connectivity between the machines. I can't find anything wrong.
Any ideas?
thanks in advance,
alexis
2011-01-07 00:14:23,156 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@551] - Initializing leader election protocol... 2011-01-07 00:14:23,157 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election. My id = 1, Proposed zxid = 0 2011-01-07 00:14:23,158 - DEBUG [WorkerSender Thread:QuorumCnxManager@346] - Opening channel to server 2 2011-01-07 00:14:23,159 - DEBUG [WorkerReceiver Thread:FastLeaderElection$Messenger$WorkerReceiver@214] - Receive new notification message. My id = 1 2011-01-07 00:14:23,160 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification: 1, 0, 1, 1, LOOKING, LOOKING, 1 2011-01-07 00:14:23,160 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@495] - id: 1, proposed id: 1, zxid: 0, proposed zxid: 0 2011-01-07 00:14:23,161 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@717] - Adding vote: >From = 1, Proposed leader = 1, Porposed zxid = 0, Proposed epoch = 1 2011-01-07 00:14:23,162 - INFO [WorkerSender Thread:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (2, 1) 2011-01-07 00:14:23,162 - DEBUG [WorkerSender Thread:QuorumCnxManager@346] - Opening channel to server 3 2011-01-07 00:14:23,172 - INFO [WorkerSender Thread:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (3, 1) 2011-01-07 00:14:23,365 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 2011-01-07 00:14:23,366 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 2011-01-07 00:14:23,366 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening channel to server 2 2011-01-07 00:14:23,367 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (2, 1) 2011-01-07 00:14:23,367 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening channel to server 3 2011-01-07 00:14:23,378 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (3, 1) 2011-01-07 00:14:23,378 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@683] - Notification time out: 400 2011-01-07 00:14:23,785 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 2011-01-07 00:14:23,785 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 2011-01-07 00:14:23,786 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening channel to server 2 2011-01-07 00:14:26,786 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller server identifier, so dropping the connection: (2, 1) ...
-
Re: server cannot join quorum
Vishal Kher 2011-01-07, 05:04
Hi,
Can you attach zoo.cfg files and logs from all the nodes? It might be also worth verifying that zk2 and zk3 are able to talk to zk1 (not firewall/ip/networking issues).
On Fri, Jan 7, 2011 at 6:02 AM, Alexis Midon <[EMAIL PROTECTED]> wrote:
> Hi there, > > I have a cluster of 3 machines, running zookeeper 3.3.1. > zk1 fails to join the quorum while zk2 and zk3 interact correctly. zk1 is > stuck in the election loop. See the log below. > I checked the config files, the connectivity between the machines. I can't > find anything wrong. > > Any ideas? > > thanks in advance, > > alexis > > 2011-01-07 00:14:23,156 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@551] - Initializing leader > election protocol... > 2011-01-07 00:14:23,157 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New election. > My id = 1, Proposed zxid = 0 > 2011-01-07 00:14:23,158 - DEBUG [WorkerSender Thread:QuorumCnxManager@346] > - > Opening channel to server 2 > 2011-01-07 00:14:23,159 - DEBUG [WorkerReceiver > Thread:FastLeaderElection$Messenger$WorkerReceiver@214] - Receive new > notification message. My id = 1 > 2011-01-07 00:14:23,160 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - Notification: > 1, 0, 1, 1, LOOKING, LOOKING, 1 > 2011-01-07 00:14:23,160 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@495] - id: 1, > proposed > id: 1, zxid: 0, proposed zxid: 0 > 2011-01-07 00:14:23,161 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@717] - Adding vote: > From = 1, Proposed leader = 1, Porposed zxid = 0, Proposed epoch = 1 > 2011-01-07 00:14:23,162 - INFO [WorkerSender Thread:QuorumCnxManager@162] > - > Have smaller server identifier, so dropping the connection: (2, 1) > 2011-01-07 00:14:23,162 - DEBUG [WorkerSender Thread:QuorumCnxManager@346] > - > Opening channel to server 3 > 2011-01-07 00:14:23,172 - INFO [WorkerSender Thread:QuorumCnxManager@162] > - > Have smaller server identifier, so dropping the connection: (3, 1) > 2011-01-07 00:14:23,365 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 > 2011-01-07 00:14:23,366 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 > 2011-01-07 00:14:23,366 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening channel > to > server 2 > 2011-01-07 00:14:23,367 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller > server identifier, so dropping the connection: (2, 1) > 2011-01-07 00:14:23,367 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening channel > to > server 3 > 2011-01-07 00:14:23,378 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller > server identifier, so dropping the connection: (3, 1) > 2011-01-07 00:14:23,378 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@683] - Notification > time out: 400 > 2011-01-07 00:14:23,785 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 > 2011-01-07 00:14:23,785 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue size: 1 > 2011-01-07 00:14:23,786 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening channel > to > server 2 > 2011-01-07 00:14:26,786 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller > server identifier, so dropping the connection: (2, 1) > ... >
-
Re: server cannot join quorum
Ted Dunning 2011-01-07, 05:17
When you checked this did you actually connect to the peer ports from the different machines? Or just ping from machine to machine?
On Thu, Jan 6, 2011 at 4:32 PM, Alexis Midon <[EMAIL PROTECTED]> wrote:
> the connectivity between the machines
-
Re: server cannot join quorum
Flavio Junqueira 2011-01-07, 15:50
Hi Alexis, Do you think you can try with 3.3.2? If it is due to leader election, then we might have fixed it already.
-Flavio
On Jan 7, 2011, at 1:32 AM, Alexis Midon wrote:
> Hi there, > > I have a cluster of 3 machines, running zookeeper 3.3.1. > zk1 fails to join the quorum while zk2 and zk3 interact correctly. > zk1 is > stuck in the election loop. See the log below. > I checked the config files, the connectivity between the machines. I > can't > find anything wrong. > > Any ideas? > > thanks in advance, > > alexis > > 2011-01-07 00:14:23,156 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumPeer@551] - Initializing > leader > election protocol... > 2011-01-07 00:14:23,157 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@649] - New > election. > My id = 1, Proposed zxid = 0 > 2011-01-07 00:14:23,158 - DEBUG [WorkerSender > Thread:QuorumCnxManager@346] - > Opening channel to server 2 > 2011-01-07 00:14:23,159 - DEBUG [WorkerReceiver > Thread:FastLeaderElection$Messenger$WorkerReceiver@214] - Receive new > notification message. My id = 1 > 2011-01-07 00:14:23,160 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@689] - > Notification: > 1, 0, 1, 1, LOOKING, LOOKING, 1 > 2011-01-07 00:14:23,160 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@495] - id: 1, > proposed > id: 1, zxid: 0, proposed zxid: 0 > 2011-01-07 00:14:23,161 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@717] - Adding > vote: > From = 1, Proposed leader = 1, Porposed zxid = 0, Proposed epoch = 1 > 2011-01-07 00:14:23,162 - INFO [WorkerSender > Thread:QuorumCnxManager@162] - > Have smaller server identifier, so dropping the connection: (2, 1) > 2011-01-07 00:14:23,162 - DEBUG [WorkerSender > Thread:QuorumCnxManager@346] - > Opening channel to server 3 > 2011-01-07 00:14:23,172 - INFO [WorkerSender > Thread:QuorumCnxManager@162] - > Have smaller server identifier, so dropping the connection: (3, 1) > 2011-01-07 00:14:23,365 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue > size: 1 > 2011-01-07 00:14:23,366 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue > size: 1 > 2011-01-07 00:14:23,366 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening > channel to > server 2 > 2011-01-07 00:14:23,367 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller > server identifier, so dropping the connection: (2, 1) > 2011-01-07 00:14:23,367 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening > channel to > server 3 > 2011-01-07 00:14:23,378 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller > server identifier, so dropping the connection: (3, 1) > 2011-01-07 00:14:23,378 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FastLeaderElection@683] - > Notification > time out: 400 > 2011-01-07 00:14:23,785 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue > size: 1 > 2011-01-07 00:14:23,785 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@391] - Queue > size: 1 > 2011-01-07 00:14:23,786 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@346] - Opening > channel to > server 2 > 2011-01-07 00:14:26,786 - INFO > [QuorumPeer:/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@162] - Have smaller > server identifier, so dropping the connection: (2, 1) > ...
flavio junqueira
research scientist
[EMAIL PROTECTED] direct +34 93-183-8828
avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: server cannot join quorum
Alexis Midon 2011-01-07, 18:32
Flavio, I guess you're refering to ZK-822 & 790 ? We're actually upgrading the environment right now.
Here are more details below. Logs are attached. I didn't take the 'connection refused' on 2888 as an error, since - afaik - followers do not always open this port. I double-checked the security groups setting with my sys admin as well.
## zoo.cfg ######################## tickTime=2000 initLimit=12 syncLimit=5
dataDir=/var/zk/data dataLogDir=/var/zk/txlog clientPort=2181
#minSessionTimeout=10000 maxSessionTimeout=900000
server.1=zk1:2888:3888 server.2=zk2:2888:3888 server.3=zk3:2888:3888
########################
> for f in {1..3}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"echo srvr | nc 127.0.0.1 2181";done zk1 --------- This ZooKeeper instance is not currently serving requests zk2 --------- Zookeeper version: 3.3.1-942149, built on 05/07/2010 17:14 GMT Latency min/avg/max: 0/3/338 Received: 1738136 Sent: 1759599 Outstanding: 0 Zxid: 0x1000de84a Mode: follower Node count: 1041 zk3 --------- Zookeeper version: 3.3.1-942149, built on 05/07/2010 17:14 GMT Latency min/avg/max: 0/1/257 Received: 338889 Sent: 367196 Outstanding: 0 Zxid: 0x1000de84a Mode: leader Node count: 1041
## ZK 1,2 to ZK 3 ############### > for f in {1..2}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"telnet zk3.prod2.i.c3-e.com 2888";done zk1 --------- Trying 10.96.42.54... Connected to ec2-75-101-171-125.compute-1.amazonaws.com. Escape character is '^]'. ^Czk2 --------- Trying 10.96.42.54... Connected to ec2-75-101-171-125.compute-1.amazonaws.com. Escape character is '^]'. > for f in {1..2}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"telnet zk3.prod2.i.c3-e.com 3888";done zk1 --------- Trying 10.96.42.54... Connected to ec2-75-101-171-125.compute-1.amazonaws.com. Escape character is '^]'. ^Czk2 --------- Trying 10.96.42.54... Connected to ec2-75-101-171-125.compute-1.amazonaws.com. Escape character is '^]'. ## ZK 2,3 to ZK 1 ############### > for f in {2..3}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"telnet zk1.prod2.i.c3-e.com 2888";done zk2 --------- Trying 10.196.155.208... telnet: Unable to connect to remote host: Connection refused zk3 --------- Trying 10.196.155.208... telnet: Unable to connect to remote host: Connection refused > for f in {2..3}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"telnet zk1.prod2.i.c3-e.com 3888";done zk2 --------- Trying 10.196.155.208... Connected to ec2-174-129-156-215.compute-1.amazonaws.com. Escape character is '^]'. ^Czk3 --------- Trying 10.196.155.208... Connected to ec2-174-129-156-215.compute-1.amazonaws.com. Escape character is '^]'. ## ZK 1,3 to ZK 2 ############### > for f in {1,3}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"telnet zk2.prod2.i.c3-e.com 2888";done zk1 --------- Trying 10.97.29.58... telnet: Unable to connect to remote host: Connection refused zk3 --------- Trying 10.97.29.58... telnet: Unable to connect to remote host: Connection refused > for f in {1,3}; do echo "zk$f --------- "; ssh zk$f.prod2.i.c3-e.com"telnet zk2.prod2.i.c3-e.com 3888";done zk1 --------- Trying 10.97.29.58... Connected to ec2-50-16-119-92.compute-1.amazonaws.com. Escape character is '^]'. ^Czk3 --------- Trying 10.97.29.58... Connected to ec2-50-16-119-92.compute-1.amazonaws.com. Escape character is '^]'.
On Thu, Jan 6, 2011 at 9:17 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> When you checked this did you actually connect to the peer ports from the > different machines? Or just ping from machine to machine? > > On Thu, Jan 6, 2011 at 4:32 PM, Alexis Midon <[EMAIL PROTECTED]> > wrote: > > > the connectivity between the machines >
|
|