I've got a cluster with 3 servers in the ensemble all running 3.4.0. After
a few days of successful operation, we observed all zookeeper reads and
writes began failing every time. In our log files, the error being reported
is INVALID_STATE. I then telnetted to port 2181 on all three servers and
was surprised to see that *two* of these servers both report they are the
leader! Two of the nodes are in agreement on the Zxid, and one of the nodes
is way out of whack with a much much larger Zxid. The node that all writes
are flowing through is the one with the much higher Zxid.
Has anyone ever seen this before? What can I do to diagnose this problem
and resolve it? I was considering killing zookeeper on the node that should
not be the leader (the one with the wrong Zxid) and removing the zookeeper
data directory, then restarting zookeeper on that node. Any other ideas?
I appreciate any help.
Patrick Hunt 2011-12-20, 17:37
Benjamin Reed 2011-12-20, 18:13
Patrick Hunt 2011-12-20, 18:17
Mahadev Konar 2011-12-20, 19:14
Marshall McMullen 2011-12-20, 19:21
Ted Dunning 2011-12-20, 19:32
Marshall McMullen 2011-12-20, 20:24
Benjamin Reed 2011-12-20, 21:44
Marshall McMullen 2011-12-20, 22:40
Benjamin Reed 2011-12-20, 19:35