|
Marshall McMullen
2012-07-09, 04:44
Camille Fournier
2012-07-09, 14:09
Marshall McMullen
2012-07-09, 14:14
Camille Fournier
2012-07-09, 14:16
Marshall McMullen
2012-07-09, 14:19
Patrick Hunt
2012-07-09, 16:48
|
-
Failure to rejoin ensemble after rebootMarshall McMullen 2012-07-09, 04:44
I'm trying to get to the bottom of a problem we're seeing where after I
forcibly reboot an ensemble node (running on Linux) via "reboot -f" it is unable to rejoin the ensemble and no clients can connect to it. Has anyone ever seen a problem like this before? I have been investigating this under https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface it looked like there was some sort of transaction/log corruption going on. But now I'm not so sure of that. What bothers me the most right now is that I am unable to reliably get the node in question to rejoin the ensemble. I've removed the contents of the "version-2" directory and restarted zookeeper to no avail. It regenerates an epoch file but never obtains the new database from a peer. I event went so far as to copy the on-disk database from another node and restart zookeeper and I still can't get it to rejoin the ensemble. I've also seen anomalous behavior where once I get it into this failed state, I just stopped all three zookeeper server processes entirely then start them all back up... then everything connects and all three nodes are in the ensemble. But this really shouldn't be necessary. None of this matches the behavior I expected. Anyone have any insight it would be greatly appreciated.
-
Re: Failure to rejoin ensemble after rebootCamille Fournier 2012-07-09, 14:09
That is very strange. What do the logs of the misbehaving server say? What
do the logs of the other servers say? What does a stack dump of the misbehaving server look like? Also, just to clarify, if you don't do anything but fully stop and restart the cluster (no deleting version-2 files etc) the whole ensemble will reform successfully? C On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen < [EMAIL PROTECTED]> wrote: > I'm trying to get to the bottom of a problem we're seeing where after I > forcibly reboot an ensemble node (running on Linux) via "reboot -f" it is > unable to rejoin the ensemble and no clients can connect to it. Has anyone > ever seen a problem like this before? > > I have been investigating this under > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface it > looked like there was some sort of transaction/log corruption going on. But > now I'm not so sure of that. > > What bothers me the most right now is that I am unable to reliably get the > node in question to rejoin the ensemble. I've removed the contents of the > "version-2" directory and restarted zookeeper to no avail. It regenerates > an epoch file but never obtains the new database from a peer. I event went > so far as to copy the on-disk database from another node and restart > zookeeper and I still can't get it to rejoin the ensemble. I've also > seen anomalous behavior where once I get it into this failed state, I just > stopped all three zookeeper server processes entirely then start them all > back up... then everything connects and all three nodes are in the > ensemble. But this really shouldn't be necessary. > > None of this matches the behavior I expected. Anyone have any insight it > would be greatly appreciated. >
-
Re: Failure to rejoin ensemble after rebootMarshall McMullen 2012-07-09, 14:14
As it turns out, it was a configuration problem. We use zookeeper in an
embedded manner so our application code creates the myid file programatically when we start zookeeper. After the reboot, it was creating the 'myid' file and putting the wrong value in there. This was a value of another ensemble node already in the cluster. I can't believe how much time was wasted on such a simple configuration problem. Given how fatal this was, it might have been useful if ZK could have detected multiple servers with the same ID and given a more helpful error message. But in any event, problem is solved now.... thanks for taking the time to respond Camille. On Mon, Jul 9, 2012 at 8:09 AM, Camille Fournier <[EMAIL PROTECTED]> wrote: > That is very strange. What do the logs of the misbehaving server say? What > do the logs of the other servers say? What does a stack dump of the > misbehaving server look like? > Also, just to clarify, if you don't do anything but fully stop and restart > the cluster (no deleting version-2 files etc) the whole ensemble will > reform successfully? > > C > > On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen < > [EMAIL PROTECTED]> wrote: > > > I'm trying to get to the bottom of a problem we're seeing where after I > > forcibly reboot an ensemble node (running on Linux) via "reboot -f" it is > > unable to rejoin the ensemble and no clients can connect to it. Has > anyone > > ever seen a problem like this before? > > > > I have been investigating this under > > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface > it > > looked like there was some sort of transaction/log corruption going on. > But > > now I'm not so sure of that. > > > > What bothers me the most right now is that I am unable to reliably get > the > > node in question to rejoin the ensemble. I've removed the contents of the > > "version-2" directory and restarted zookeeper to no avail. It regenerates > > an epoch file but never obtains the new database from a peer. I event > went > > so far as to copy the on-disk database from another node and restart > > zookeeper and I still can't get it to rejoin the ensemble. I've also > > seen anomalous behavior where once I get it into this failed state, I > just > > stopped all three zookeeper server processes entirely then start them all > > back up... then everything connects and all three nodes are in the > > ensemble. But this really shouldn't be necessary. > > > > None of this matches the behavior I expected. Anyone have any insight it > > would be greatly appreciated. > > >
-
Re: Failure to rejoin ensemble after rebootCamille Fournier 2012-07-09, 14:16
I was thinking the same thing when I answered that email earlier this week
about the lack of myid causing an error that is difficult to trace. I kind of hate the myid file, why is it necessary in the first place? There must be a cleaner way for us to identify servers and avoid conflicts. C On Mon, Jul 9, 2012 at 10:14 AM, Marshall McMullen < [EMAIL PROTECTED]> wrote: > As it turns out, it was a configuration problem. We use zookeeper in an > embedded manner so our application code creates the myid file > programatically when we start zookeeper. After the reboot, it was creating > the 'myid' file and putting the wrong value in there. This was a value of > another ensemble node already in the cluster. I can't believe how much time > was wasted on such a simple configuration problem. Given how fatal this > was, it might have been useful if ZK could have detected multiple servers > with the same ID and given a more helpful error message. But in any event, > problem is solved now.... thanks for taking the time to respond Camille. > > On Mon, Jul 9, 2012 at 8:09 AM, Camille Fournier <[EMAIL PROTECTED]> > wrote: > > > That is very strange. What do the logs of the misbehaving server say? > What > > do the logs of the other servers say? What does a stack dump of the > > misbehaving server look like? > > Also, just to clarify, if you don't do anything but fully stop and > restart > > the cluster (no deleting version-2 files etc) the whole ensemble will > > reform successfully? > > > > C > > > > On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen < > > [EMAIL PROTECTED]> wrote: > > > > > I'm trying to get to the bottom of a problem we're seeing where after I > > > forcibly reboot an ensemble node (running on Linux) via "reboot -f" it > is > > > unable to rejoin the ensemble and no clients can connect to it. Has > > anyone > > > ever seen a problem like this before? > > > > > > I have been investigating this under > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface > > it > > > looked like there was some sort of transaction/log corruption going on. > > But > > > now I'm not so sure of that. > > > > > > What bothers me the most right now is that I am unable to reliably get > > the > > > node in question to rejoin the ensemble. I've removed the contents of > the > > > "version-2" directory and restarted zookeeper to no avail. It > regenerates > > > an epoch file but never obtains the new database from a peer. I event > > went > > > so far as to copy the on-disk database from another node and restart > > > zookeeper and I still can't get it to rejoin the ensemble. I've also > > > seen anomalous behavior where once I get it into this failed state, I > > just > > > stopped all three zookeeper server processes entirely then start them > all > > > back up... then everything connects and all three nodes are in the > > > ensemble. But this really shouldn't be necessary. > > > > > > None of this matches the behavior I expected. Anyone have any insight > it > > > would be greatly appreciated. > > > > > >
-
Re: Failure to rejoin ensemble after rebootMarshall McMullen 2012-07-09, 14:19
I completely agree. This is not the first time this has caused problems for
sure. At a minimum a more helpful error message when there is a missing myid file or a collision in server IDs would have been a life saver. On Mon, Jul 9, 2012 at 8:16 AM, Camille Fournier <[EMAIL PROTECTED]> wrote: > I was thinking the same thing when I answered that email earlier this week > about the lack of myid causing an error that is difficult to trace. I kind > of hate the myid file, why is it necessary in the first place? There must > be a cleaner way for us to identify servers and avoid conflicts. > > C > > On Mon, Jul 9, 2012 at 10:14 AM, Marshall McMullen < > [EMAIL PROTECTED]> wrote: > > > As it turns out, it was a configuration problem. We use zookeeper in an > > embedded manner so our application code creates the myid file > > programatically when we start zookeeper. After the reboot, it was > creating > > the 'myid' file and putting the wrong value in there. This was a value of > > another ensemble node already in the cluster. I can't believe how much > time > > was wasted on such a simple configuration problem. Given how fatal this > > was, it might have been useful if ZK could have detected multiple servers > > with the same ID and given a more helpful error message. But in any > event, > > problem is solved now.... thanks for taking the time to respond Camille. > > > > On Mon, Jul 9, 2012 at 8:09 AM, Camille Fournier <[EMAIL PROTECTED]> > > wrote: > > > > > That is very strange. What do the logs of the misbehaving server say? > > What > > > do the logs of the other servers say? What does a stack dump of the > > > misbehaving server look like? > > > Also, just to clarify, if you don't do anything but fully stop and > > restart > > > the cluster (no deleting version-2 files etc) the whole ensemble will > > > reform successfully? > > > > > > C > > > > > > On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen < > > > [EMAIL PROTECTED]> wrote: > > > > > > > I'm trying to get to the bottom of a problem we're seeing where > after I > > > > forcibly reboot an ensemble node (running on Linux) via "reboot -f" > it > > is > > > > unable to rejoin the ensemble and no clients can connect to it. Has > > > anyone > > > > ever seen a problem like this before? > > > > > > > > I have been investigating this under > > > > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the > surface > > > it > > > > looked like there was some sort of transaction/log corruption going > on. > > > But > > > > now I'm not so sure of that. > > > > > > > > What bothers me the most right now is that I am unable to reliably > get > > > the > > > > node in question to rejoin the ensemble. I've removed the contents of > > the > > > > "version-2" directory and restarted zookeeper to no avail. It > > regenerates > > > > an epoch file but never obtains the new database from a peer. I event > > > went > > > > so far as to copy the on-disk database from another node and restart > > > > zookeeper and I still can't get it to rejoin the ensemble. I've also > > > > seen anomalous behavior where once I get it into this failed state, I > > > just > > > > stopped all three zookeeper server processes entirely then start them > > all > > > > back up... then everything connects and all three nodes are in the > > > > ensemble. But this really shouldn't be necessary. > > > > > > > > None of this matches the behavior I expected. Anyone have any insight > > it > > > > would be greatly appreciated. > > > > > > > > > >
-
Re: Failure to rejoin ensemble after rebootPatrick Hunt 2012-07-09, 16:48
On Mon, Jul 9, 2012 at 7:16 AM, Camille Fournier <[EMAIL PROTECTED]> wrote:
> I was thinking the same thing when I answered that email earlier this week > about the lack of myid causing an error that is difficult to trace. I kind > of hate the myid file, why is it necessary in the first place? There must > be a cleaner way for us to identify servers and avoid conflicts. This question has come up in the past. iirc Ben/Flavio mentioned that they wanted to treat the data as a self-contained unit. ie the data is specific to a particular id. Having servers identify id conflict seems reasonable, I'd suggest entering a jira for that. Patrick > > On Mon, Jul 9, 2012 at 10:14 AM, Marshall McMullen < > [EMAIL PROTECTED]> wrote: > >> As it turns out, it was a configuration problem. We use zookeeper in an >> embedded manner so our application code creates the myid file >> programatically when we start zookeeper. After the reboot, it was creating >> the 'myid' file and putting the wrong value in there. This was a value of >> another ensemble node already in the cluster. I can't believe how much time >> was wasted on such a simple configuration problem. Given how fatal this >> was, it might have been useful if ZK could have detected multiple servers >> with the same ID and given a more helpful error message. But in any event, >> problem is solved now.... thanks for taking the time to respond Camille. >> >> On Mon, Jul 9, 2012 at 8:09 AM, Camille Fournier <[EMAIL PROTECTED]> >> wrote: >> >> > That is very strange. What do the logs of the misbehaving server say? >> What >> > do the logs of the other servers say? What does a stack dump of the >> > misbehaving server look like? >> > Also, just to clarify, if you don't do anything but fully stop and >> restart >> > the cluster (no deleting version-2 files etc) the whole ensemble will >> > reform successfully? >> > >> > C >> > >> > On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen < >> > [EMAIL PROTECTED]> wrote: >> > >> > > I'm trying to get to the bottom of a problem we're seeing where after I >> > > forcibly reboot an ensemble node (running on Linux) via "reboot -f" it >> is >> > > unable to rejoin the ensemble and no clients can connect to it. Has >> > anyone >> > > ever seen a problem like this before? >> > > >> > > I have been investigating this under >> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface >> > it >> > > looked like there was some sort of transaction/log corruption going on. >> > But >> > > now I'm not so sure of that. >> > > >> > > What bothers me the most right now is that I am unable to reliably get >> > the >> > > node in question to rejoin the ensemble. I've removed the contents of >> the >> > > "version-2" directory and restarted zookeeper to no avail. It >> regenerates >> > > an epoch file but never obtains the new database from a peer. I event >> > went >> > > so far as to copy the on-disk database from another node and restart >> > > zookeeper and I still can't get it to rejoin the ensemble. I've also >> > > seen anomalous behavior where once I get it into this failed state, I >> > just >> > > stopped all three zookeeper server processes entirely then start them >> all >> > > back up... then everything connects and all three nodes are in the >> > > ensemble. But this really shouldn't be necessary. >> > > >> > > None of this matches the behavior I expected. Anyone have any insight >> it >> > > would be greatly appreciated. >> > > >> > >> |