|
|
Marshall McMullen 2011-12-20, 17:00
Zookeeper devs,
I've got a cluster with 3 servers in the ensemble all running 3.4.0. After a few days of successful operation, we observed all zookeeper reads and writes began failing every time. In our log files, the error being reported is INVALID_STATE. I then telnetted to port 2181 on all three servers and was surprised to see that *two* of these servers both report they are the leader! Two of the nodes are in agreement on the Zxid, and one of the nodes is way out of whack with a much much larger Zxid. The node that all writes are flowing through is the one with the much higher Zxid.
Has anyone ever seen this before? What can I do to diagnose this problem and resolve it? I was considering killing zookeeper on the node that should not be the leader (the one with the wrong Zxid) and removing the zookeeper data directory, then restarting zookeeper on that node. Any other ideas?
I appreciate any help.
+
Marshall McMullen 2011-12-20, 17:00
Patrick Hunt 2011-12-20, 17:37
The logs should have details on what happened. If you can provide them from around the time this occurred it would likely provide insight.
Note that 3.4.0 has a serious problem wrt cluster consistency, I don't see how this would result in two leaders being elected however.
Patrick
On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen <[EMAIL PROTECTED]> wrote: > Zookeeper devs, > > I've got a cluster with 3 servers in the ensemble all running 3.4.0. After > a few days of successful operation, we observed all zookeeper reads and > writes began failing every time. In our log files, the error being reported > is INVALID_STATE. I then telnetted to port 2181 on all three servers and > was surprised to see that *two* of these servers both report they are the > leader! Two of the nodes are in agreement on the Zxid, and one of the nodes > is way out of whack with a much much larger Zxid. The node that all writes > are flowing through is the one with the much higher Zxid. > > Has anyone ever seen this before? What can I do to diagnose this problem > and resolve it? I was considering killing zookeeper on the node that should > not be the leader (the one with the wrong Zxid) and removing the zookeeper > data directory, then restarting zookeeper on that node. Any other ideas? > > I appreciate any help.
+
Patrick Hunt 2011-12-20, 17:37
Benjamin Reed 2011-12-20, 18:13
i've seen it before when the configuration files haven't been setup properly. i would check the configuration. if the leader is still the leader, it must have active followers connected to it, otherwise it would give up leadership. i would use netstat to find out who they are.
ben
On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen <[EMAIL PROTECTED]> wrote: > Zookeeper devs, > > I've got a cluster with 3 servers in the ensemble all running 3.4.0. After > a few days of successful operation, we observed all zookeeper reads and > writes began failing every time. In our log files, the error being reported > is INVALID_STATE. I then telnetted to port 2181 on all three servers and > was surprised to see that *two* of these servers both report they are the > leader! Two of the nodes are in agreement on the Zxid, and one of the nodes > is way out of whack with a much much larger Zxid. The node that all writes > are flowing through is the one with the much higher Zxid. > > Has anyone ever seen this before? What can I do to diagnose this problem > and resolve it? I was considering killing zookeeper on the node that should > not be the leader (the one with the wrong Zxid) and removing the zookeeper > data directory, then restarting zookeeper on that node. Any other ideas? > > I appreciate any help.
+
Benjamin Reed 2011-12-20, 18:13
Patrick Hunt 2011-12-20, 18:17
Really the logs are critical here. If you can provide them it would shed light.
Patrick
On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > i've seen it before when the configuration files haven't been setup > properly. i would check the configuration. if the leader is still the > leader, it must have active followers connected to it, otherwise it > would give up leadership. i would use netstat to find out who they > are. > > ben > > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen > <[EMAIL PROTECTED]> wrote: >> Zookeeper devs, >> >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. After >> a few days of successful operation, we observed all zookeeper reads and >> writes began failing every time. In our log files, the error being reported >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and >> was surprised to see that *two* of these servers both report they are the >> leader! Two of the nodes are in agreement on the Zxid, and one of the nodes >> is way out of whack with a much much larger Zxid. The node that all writes >> are flowing through is the one with the much higher Zxid. >> >> Has anyone ever seen this before? What can I do to diagnose this problem >> and resolve it? I was considering killing zookeeper on the node that should >> not be the leader (the one with the wrong Zxid) and removing the zookeeper >> data directory, then restarting zookeeper on that node. Any other ideas? >> >> I appreciate any help.
+
Patrick Hunt 2011-12-20, 18:17
Mahadev Konar 2011-12-20, 19:14
Agree with Pat. We should dig into this ASAP.
Marshall, Mind opening a jira nad posting the logs to it?
thanks mahadev
On Tue, Dec 20, 2011 at 10:17 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
> Really the logs are critical here. If you can provide them it would shed > light. > > Patrick > > On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > > i've seen it before when the configuration files haven't been setup > > properly. i would check the configuration. if the leader is still the > > leader, it must have active followers connected to it, otherwise it > > would give up leadership. i would use netstat to find out who they > > are. > > > > ben > > > > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen > > <[EMAIL PROTECTED]> wrote: > >> Zookeeper devs, > >> > >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. > After > >> a few days of successful operation, we observed all zookeeper reads and > >> writes began failing every time. In our log files, the error being > reported > >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and > >> was surprised to see that *two* of these servers both report they are > the > >> leader! Two of the nodes are in agreement on the Zxid, and one of the > nodes > >> is way out of whack with a much much larger Zxid. The node that all > writes > >> are flowing through is the one with the much higher Zxid. > >> > >> Has anyone ever seen this before? What can I do to diagnose this problem > >> and resolve it? I was considering killing zookeeper on the node that > should > >> not be the leader (the one with the wrong Zxid) and removing the > zookeeper > >> data directory, then restarting zookeeper on that node. Any other ideas? > >> > >> I appreciate any help. >
+
Mahadev Konar 2011-12-20, 19:14
Marshall McMullen 2011-12-20, 19:21
What specific log files should I look for?
I inspected the config files for all 3 nodes and they *are different. *Specifically, the servers specified are not consistent:
$ cat /data/zookeeper/10.10.5.56/10.10.5.56_2181.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/10.10.5.56/ maxClientCnxns=1000 clientPortAddress=10.10.5.56 clientPort=2181 server.1=10.10.5.46:2182:2183 server.2=10.10.5.35:2182:2183 server.3=10.10.5.56:2182:2183
$ cat /data/zookeeper/10.10.5.58/10.10.5.58_2181.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/10.10.5.58/ maxClientCnxns=1000 clientPortAddress=10.10.5.58 clientPort=2181 server.1=10.10.5.46:2182:2183 server.2=10.10.5.56:2182:2183 server.3=10.10.5.58:2182:2183
$ cat /data/zookeeper/10.10.5.46/10.10.5.46_2181.cfg tickTime=2000 initLimit=10 syncLimit=5 dataDir=/data/zookeeper/10.10.5.46/ maxClientCnxns=1000 clientPortAddress=10.10.5.46 clientPort=2181 server.1=10.10.5.46:2182:2183 server.2=10.10.5.35:2182:2183 server.3=10.10.5.56:2182:2183
So this looks like a configuration problem not a zookeeper bug correct? On Tue, Dec 20, 2011 at 11:17 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
> Really the logs are critical here. If you can provide them it would shed > light. > > Patrick > > On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > > i've seen it before when the configuration files haven't been setup > > properly. i would check the configuration. if the leader is still the > > leader, it must have active followers connected to it, otherwise it > > would give up leadership. i would use netstat to find out who they > > are. > > > > ben > > > > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen > > <[EMAIL PROTECTED]> wrote: > >> Zookeeper devs, > >> > >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. > After > >> a few days of successful operation, we observed all zookeeper reads and > >> writes began failing every time. In our log files, the error being > reported > >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and > >> was surprised to see that *two* of these servers both report they are > the > >> leader! Two of the nodes are in agreement on the Zxid, and one of the > nodes > >> is way out of whack with a much much larger Zxid. The node that all > writes > >> are flowing through is the one with the much higher Zxid. > >> > >> Has anyone ever seen this before? What can I do to diagnose this problem > >> and resolve it? I was considering killing zookeeper on the node that > should > >> not be the leader (the one with the wrong Zxid) and removing the > zookeeper > >> data directory, then restarting zookeeper on that node. Any other ideas? > >> > >> I appreciate any help. >
+
Marshall McMullen 2011-12-20, 19:21
Ted Dunning 2011-12-20, 19:32
On Tue, Dec 20, 2011 at 11:21 AM, Marshall McMullen < [EMAIL PROTECTED]> wrote:
> What specific log files should I look for? > > I inspected the config files for all 3 nodes and they *are different. > *Specifically, > the servers specified are not consistent: > > ... > > So this looks like a configuration problem not a zookeeper bug correct? Well, it isn't good!
If the problem goes away when the config is fixed, then your conclusion is accurate.
+
Ted Dunning 2011-12-20, 19:32
Marshall McMullen 2011-12-20, 20:24
After fixing the config files and restarting everything, all the problems went away. Definitely looks like a configuration problem.
Thanks everyone for helping me to diagnose this.
On Tue, Dec 20, 2011 at 12:32 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> On Tue, Dec 20, 2011 at 11:21 AM, Marshall McMullen < > [EMAIL PROTECTED]> wrote: > > > What specific log files should I look for? > > > > I inspected the config files for all 3 nodes and they *are different. > > *Specifically, > > the servers specified are not consistent: > > > > ... > > > > So this looks like a configuration problem not a zookeeper bug correct? > > > Well, it isn't good! > > If the problem goes away when the config is fixed, then your conclusion is > accurate. >
+
Marshall McMullen 2011-12-20, 20:24
Benjamin Reed 2011-12-20, 21:44
there is a jira for an improvement that would add to the protocol checks to make sure that the server configurations are compatible and make sure that client configurations are compatible. i can't seem to find them though...
ben
On Tue, Dec 20, 2011 at 12:24 PM, Marshall McMullen <[EMAIL PROTECTED]> wrote: > After fixing the config files and restarting everything, all the problems > went away. Definitely looks like a configuration problem. > > Thanks everyone for helping me to diagnose this. > > On Tue, Dec 20, 2011 at 12:32 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> On Tue, Dec 20, 2011 at 11:21 AM, Marshall McMullen < >> [EMAIL PROTECTED]> wrote: >> >> > What specific log files should I look for? >> > >> > I inspected the config files for all 3 nodes and they *are different. >> > *Specifically, >> > the servers specified are not consistent: >> > >> > ... >> > >> > So this looks like a configuration problem not a zookeeper bug correct? >> >> >> Well, it isn't good! >> >> If the problem goes away when the config is fixed, then your conclusion is >> accurate. >>
+
Benjamin Reed 2011-12-20, 21:44
Marshall McMullen 2011-12-20, 22:40
Something like that would be really handy to have. I was thinking of how I might add something like that to our code sitting on top of zookeeper. Essentially comparing the zookeeper config file on each node to what we expect it to be to ensure everything is consistent. But something built into zookeeper would be much better IMO.
On Tue, Dec 20, 2011 at 2:44 PM, Benjamin Reed <[EMAIL PROTECTED]> wrote:
> there is a jira for an improvement that would add to the protocol > checks to make sure that the server configurations are compatible and > make sure that client configurations are compatible. i can't seem to > find them though... > > ben > > On Tue, Dec 20, 2011 at 12:24 PM, Marshall McMullen > <[EMAIL PROTECTED]> wrote: > > After fixing the config files and restarting everything, all the problems > > went away. Definitely looks like a configuration problem. > > > > Thanks everyone for helping me to diagnose this. > > > > On Tue, Dec 20, 2011 at 12:32 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > >> On Tue, Dec 20, 2011 at 11:21 AM, Marshall McMullen < > >> [EMAIL PROTECTED]> wrote: > >> > >> > What specific log files should I look for? > >> > > >> > I inspected the config files for all 3 nodes and they *are different. > >> > *Specifically, > >> > the servers specified are not consistent: > >> > > >> > ... > >> > > >> > So this looks like a configuration problem not a zookeeper bug > correct? > >> > >> > >> Well, it isn't good! > >> > >> If the problem goes away when the config is fixed, then your conclusion > is > >> accurate. > >> >
+
Marshall McMullen 2011-12-20, 22:40
Benjamin Reed 2011-12-20, 19:35
yes this is a configuration problem. 10.10.5.35 must be running as well right?
ben
On Tue, Dec 20, 2011 at 11:21 AM, Marshall McMullen <[EMAIL PROTECTED]> wrote: > What specific log files should I look for? > > I inspected the config files for all 3 nodes and they *are different. > *Specifically, > the servers specified are not consistent: > > $ cat /data/zookeeper/10.10.5.56/10.10.5.56_2181.cfg > tickTime=2000 > initLimit=10 > syncLimit=5 > dataDir=/data/zookeeper/10.10.5.56/ > maxClientCnxns=1000 > clientPortAddress=10.10.5.56 > clientPort=2181 > server.1=10.10.5.46:2182:2183 > server.2=10.10.5.35:2182:2183 > server.3=10.10.5.56:2182:2183 > > $ cat /data/zookeeper/10.10.5.58/10.10.5.58_2181.cfg > tickTime=2000 > initLimit=10 > syncLimit=5 > dataDir=/data/zookeeper/10.10.5.58/ > maxClientCnxns=1000 > clientPortAddress=10.10.5.58 > clientPort=2181 > server.1=10.10.5.46:2182:2183 > server.2=10.10.5.56:2182:2183 > server.3=10.10.5.58:2182:2183 > > $ cat /data/zookeeper/10.10.5.46/10.10.5.46_2181.cfg > tickTime=2000 > initLimit=10 > syncLimit=5 > dataDir=/data/zookeeper/10.10.5.46/ > maxClientCnxns=1000 > clientPortAddress=10.10.5.46 > clientPort=2181 > server.1=10.10.5.46:2182:2183 > server.2=10.10.5.35:2182:2183 > server.3=10.10.5.56:2182:2183 > > So this looks like a configuration problem not a zookeeper bug correct? > > > On Tue, Dec 20, 2011 at 11:17 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > >> Really the logs are critical here. If you can provide them it would shed >> light. >> >> Patrick >> >> On Tue, Dec 20, 2011 at 10:13 AM, Benjamin Reed <[EMAIL PROTECTED]> wrote: >> > i've seen it before when the configuration files haven't been setup >> > properly. i would check the configuration. if the leader is still the >> > leader, it must have active followers connected to it, otherwise it >> > would give up leadership. i would use netstat to find out who they >> > are. >> > >> > ben >> > >> > On Tue, Dec 20, 2011 at 9:00 AM, Marshall McMullen >> > <[EMAIL PROTECTED]> wrote: >> >> Zookeeper devs, >> >> >> >> I've got a cluster with 3 servers in the ensemble all running 3.4.0. >> After >> >> a few days of successful operation, we observed all zookeeper reads and >> >> writes began failing every time. In our log files, the error being >> reported >> >> is INVALID_STATE. I then telnetted to port 2181 on all three servers and >> >> was surprised to see that *two* of these servers both report they are >> the >> >> leader! Two of the nodes are in agreement on the Zxid, and one of the >> nodes >> >> is way out of whack with a much much larger Zxid. The node that all >> writes >> >> are flowing through is the one with the much higher Zxid. >> >> >> >> Has anyone ever seen this before? What can I do to diagnose this problem >> >> and resolve it? I was considering killing zookeeper on the node that >> should >> >> not be the leader (the one with the wrong Zxid) and removing the >> zookeeper >> >> data directory, then restarting zookeeper on that node. Any other ideas? >> >> >> >> I appreciate any help. >>
+
Benjamin Reed 2011-12-20, 19:35
|
|