I have 3 node zookeeper 126.96.36.1998648 quorum on my setup.
We came across a situation where one of the zk server in the cluster went
due to bad disk.
We observed that leader election keeps running in loop (starts, completes
and again starts). The loop repeats every couple of minutes.
Even restarting zookeeper server on both nodes doesn't help recovering from
Network connection looks fine though, as I could telnet leader election
port and ssh from one node to other.
zookeeper client on each node is using "127.0.0.1:2181" as quorum string
for connecting to server, therefore if local zookeeper server is down
client app is dead.
I have uploaded zookeeper.log for both nodes at following link:
Any idea what might be wrong with the quorum? Please note that restarting
zookeeper server on both nodes doesn't help to recover from this situations.
Thanks & Regards,