Marshall McMullen 2012-07-09, 04:44
Camille Fournier 2012-07-09, 14:09
Marshall McMullen 2012-07-09, 14:14
Camille Fournier 2012-07-09, 14:16
-Re: Failure to rejoin ensemble after reboot
Patrick Hunt 2012-07-09, 16:48
On Mon, Jul 9, 2012 at 7:16 AM, Camille Fournier <[EMAIL PROTECTED]> wrote:
> I was thinking the same thing when I answered that email earlier this week
> about the lack of myid causing an error that is difficult to trace. I kind
> of hate the myid file, why is it necessary in the first place? There must
> be a cleaner way for us to identify servers and avoid conflicts.
This question has come up in the past. iirc Ben/Flavio mentioned that
they wanted to treat the data as a self-contained unit. ie the data is
specific to a particular id.
Having servers identify id conflict seems reasonable, I'd suggest
entering a jira for that.
> On Mon, Jul 9, 2012 at 10:14 AM, Marshall McMullen <
> [EMAIL PROTECTED]> wrote:
>> As it turns out, it was a configuration problem. We use zookeeper in an
>> embedded manner so our application code creates the myid file
>> programatically when we start zookeeper. After the reboot, it was creating
>> the 'myid' file and putting the wrong value in there. This was a value of
>> another ensemble node already in the cluster. I can't believe how much time
>> was wasted on such a simple configuration problem. Given how fatal this
>> was, it might have been useful if ZK could have detected multiple servers
>> with the same ID and given a more helpful error message. But in any event,
>> problem is solved now.... thanks for taking the time to respond Camille.
>> On Mon, Jul 9, 2012 at 8:09 AM, Camille Fournier <[EMAIL PROTECTED]>
>> > That is very strange. What do the logs of the misbehaving server say?
>> > do the logs of the other servers say? What does a stack dump of the
>> > misbehaving server look like?
>> > Also, just to clarify, if you don't do anything but fully stop and
>> > the cluster (no deleting version-2 files etc) the whole ensemble will
>> > reform successfully?
>> > C
>> > On Mon, Jul 9, 2012 at 12:44 AM, Marshall McMullen <
>> > [EMAIL PROTECTED]> wrote:
>> > > I'm trying to get to the bottom of a problem we're seeing where after I
>> > > forcibly reboot an ensemble node (running on Linux) via "reboot -f" it
>> > > unable to rejoin the ensemble and no clients can connect to it. Has
>> > anyone
>> > > ever seen a problem like this before?
>> > >
>> > > I have been investigating this under
>> > > https://issues.apache.org/jira/browse/ZOOKEEPER-1453 as on the surface
>> > it
>> > > looked like there was some sort of transaction/log corruption going on.
>> > But
>> > > now I'm not so sure of that.
>> > >
>> > > What bothers me the most right now is that I am unable to reliably get
>> > the
>> > > node in question to rejoin the ensemble. I've removed the contents of
>> > > "version-2" directory and restarted zookeeper to no avail. It
>> > > an epoch file but never obtains the new database from a peer. I event
>> > went
>> > > so far as to copy the on-disk database from another node and restart
>> > > zookeeper and I still can't get it to rejoin the ensemble. I've also
>> > > seen anomalous behavior where once I get it into this failed state, I
>> > just
>> > > stopped all three zookeeper server processes entirely then start them
>> > > back up... then everything connects and all three nodes are in the
>> > > ensemble. But this really shouldn't be necessary.
>> > >
>> > > None of this matches the behavior I expected. Anyone have any insight
>> > > would be greatly appreciated.
>> > >
Marshall McMullen 2012-07-09, 14:19