|
Scott Lindner
2012-03-07, 17:48
Ted Dunning
2012-03-07, 18:59
Alexander Shraer
2012-03-08, 00:07
Ted Dunning
2012-03-08, 00:55
Alexander Shraer
2012-03-08, 03:08
Ted Dunning
2012-03-08, 08:31
Alexander Shraer
2012-03-08, 23:09
Ted Dunning
2012-03-08, 23:12
|
-
Possibility / consequences of having multiple elected leadersScott Lindner 2012-03-07, 17:48
In debugging our problem where we had a zookeeper cluster failure (separate
thread) we ran across something that might have happened that could have caused one of our servers to be quite a bit behind the other two. We are running a cluster of 3 zookeeper servers in our development cluster on Windows. These are not running as services and are just started from the command prompt. Because of this, it's possible that one of the servers had their command output frozen by someone clicking / marking it. We saw this happen accidentally while debugging and the end result is obviously that all requests to that server back up until they either time out or the command prompt is unfrozen. This got us to wondering what would happen if the elected leader were "frozen" in this manner? There's no guarantees where in the code it would be hung to know for certain what would happen when it left this state, but could there be any problems where the "frozen" server would come out of this state still thinking it was the leader (since it was stuck) when in fact another server had been elected in the meantime? I would imagine this should resolve itself fairly quickly but is there still a possibility that this could lead to bad behavior? Typically if a server fails I would imagine the zookeeper instance would die or lose leadership because of an event (failed connection, etc) but this seems slightly different since the code would be blocked in a random state. This seems to be more of a Windows issue given how its command prompts work vs. other OS and we're going to avoid this by either installing a service that is responsible for starting the zookeeper servers or piping the output to a file where we can tail the output. Thanks, -Scott
-
Re: Possibility / consequences of having multiple elected leadersTed Dunning 2012-03-07, 18:59
This can be emulated on Linux by simply pausing the process.
The correct behavior is that the old leader will freeze and if it comes back relatively soon, it will still be recognized as leader. If the pause is long enough, then the other members of the quorum will decide that they have lost contact with the leader and initiate a new leader election. That election will cause the epoch to be incremented. When the old leader returns, it may attempt to commit a change. Such a commit will be rejected due to an old epoch. Alternately, it will get a ping or a commit from the other servers and realize that it is behind and initiate a resynchronization. Even if the old leader had started a commit before being paused, the commit will have either succeeded in becoming durable or not. Neither case will cause any discrepancies since the leader election will cause the remaining quorum to agree on a correct state. In any case, the paused server should either survive as leader with the assent of a quorum or it should realize it is no longer the leader and transparently update itself to the current state of the quorum. On Wed, Mar 7, 2012 at 9:48 AM, Scott Lindner <[EMAIL PROTECTED]>wrote: > ... > This got us to wondering what would happen if the elected leader were > "frozen" in this manner? There's no guarantees where in the code it would > be hung to know for certain what would happen when it left this state, but > could there be any problems where the "frozen" server would come out of > this state still thinking it was the leader (since it was stuck) when in > fact another server had been elected in the meantime? I would imagine this > should resolve itself fairly quickly but is there still a possibility that > this could lead to bad behavior? Typically if a server fails I would > imagine the zookeeper instance would die or lose leadership because of an > event (failed connection, etc) but this seems slightly different since the > code would be blocked in a random state. > ...
-
RE: Possibility / consequences of having multiple elected leadersAlexander Shraer 2012-03-08, 00:07
> Such a commit will be rejected due to an old epoch.
Ted, can you please point me to the place in the code where this check is performed ? Thanks a lot, Alex > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, March 07, 2012 10:59 AM > To: [EMAIL PROTECTED] > Subject: Re: Possibility / consequences of having multiple elected > leaders > > This can be emulated on Linux by simply pausing the process. > > The correct behavior is that the old leader will freeze and if it comes > back relatively soon, it will still be recognized as leader. > > If the pause is long enough, then the other members of the quorum will > decide that they have lost contact with the leader and initiate a new > leader election. That election will cause the epoch to be incremented. > When the old leader returns, it may attempt to commit a change. Such > a > commit will be rejected due to an old epoch. Alternately, it will get > a > ping or a commit from the other servers and realize that it is behind > and > initiate a resynchronization. Even if the old leader had started a > commit > before being paused, the commit will have either succeeded in becoming > durable or not. Neither case will cause any discrepancies since the > leader > election will cause the remaining quorum to agree on a correct state. > > In any case, the paused server should either survive as leader with the > assent of a quorum or it should realize it is no longer the leader and > transparently update itself to the current state of the quorum. > > On Wed, Mar 7, 2012 at 9:48 AM, Scott Lindner > <[EMAIL PROTECTED]>wrote: > > > ... > > This got us to wondering what would happen if the elected leader were > > "frozen" in this manner? There's no guarantees where in the code it > would > > be hung to know for certain what would happen when it left this > state, but > > could there be any problems where the "frozen" server would come out > of > > this state still thinking it was the leader (since it was stuck) when > in > > fact another server had been elected in the meantime? I would > imagine this > > should resolve itself fairly quickly but is there still a possibility > that > > this could lead to bad behavior? Typically if a server fails I would > > imagine the zookeeper instance would die or lose leadership because > of an > > event (failed connection, etc) but this seems slightly different > since the > > code would be blocked in a random state. > > ...
-
Re: Possibility / consequences of having multiple elected leadersTed Dunning 2012-03-08, 00:55
Not off the cuff and I have to run away right now.
On Wed, Mar 7, 2012 at 4:07 PM, Alexander Shraer <[EMAIL PROTECTED]>wrote: > > Such a commit will be rejected due to an old epoch. > > Ted, can you please point me to the place in the code where this check is > performed ? > > Thanks a lot, > Alex >
-
RE: Possibility / consequences of having multiple elected leadersAlexander Shraer 2012-03-08, 03:08
I’ve been wondering about this for a while, and suspect that this check doesn’t exist in the code… but I may be wrong.
From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 07, 2012 4:55 PM To: Alexander Shraer Cc: [EMAIL PROTECTED] Subject: Re: Possibility / consequences of having multiple elected leaders Not off the cuff and I have to run away right now. On Wed, Mar 7, 2012 at 4:07 PM, Alexander Shraer <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Such a commit will be rejected due to an old epoch. Ted, can you please point me to the place in the code where this check is performed ? Thanks a lot, Alex
-
Re: Possibility / consequences of having multiple elected leadersTed Dunning 2012-03-08, 08:31
The whole point of the zab protocol is to ensure that only one elected leader can exist at one time. Since a quorum has to commit to supporting any leader there can't be two leaders. Furthermore each change of leadership increments the epoch and that increment had to be committed on a majority of node. That means that only one leader can exist in the latest epoch. Since the latest epoch is, by definition, acknowledged by a majority of nodes, an old leader cannot resurface as a pretender to the throne.
Sent from my iPhone On Mar 7, 2012, at 7:08 PM, Alexander Shraer <[EMAIL PROTECTED]> wrote: > I’ve been wondering about this for a while, and suspect that this check doesn’t exist in the code… but I may be wrong. > > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, March 07, 2012 4:55 PM > To: Alexander Shraer > Cc: [EMAIL PROTECTED] > Subject: Re: Possibility / consequences of having multiple elected leaders > > Not off the cuff and I have to run away right now. > > On Wed, Mar 7, 2012 at 4:07 PM, Alexander Shraer <[EMAIL PROTECTED]> wrote: > > Such a commit will be rejected due to an old epoch. > > Ted, can you please point me to the place in the code where this check is performed ? > > Thanks a lot, > Alex >
-
RE: Possibility / consequences of having multiple elected leadersAlexander Shraer 2012-03-08, 23:09
Thanks Ted, I can see your point. We use TCP connections and we do the epoch check at the beginning of the protocol, so
a message from an old leader cannot just resurface. Alex From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Thursday, March 08, 2012 12:32 AM To: Alexander Shraer Cc: [EMAIL PROTECTED] Subject: Re: Possibility / consequences of having multiple elected leaders The whole point of the zab protocol is to ensure that only one elected leader can exist at one time. Since a quorum has to commit to supporting any leader there can't be two leaders. Furthermore each change of leadership increments the epoch and that increment had to be committed on a majority of node. That means that only one leader can exist in the latest epoch. Since the latest epoch is, by definition, acknowledged by a majority of nodes, an old leader cannot resurface as a pretender to the throne. Sent from my iPhone On Mar 7, 2012, at 7:08 PM, Alexander Shraer <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I’ve been wondering about this for a while, and suspect that this check doesn’t exist in the code… but I may be wrong. From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 07, 2012 4:55 PM To: Alexander Shraer Cc: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Subject: Re: Possibility / consequences of having multiple elected leaders Not off the cuff and I have to run away right now. On Wed, Mar 7, 2012 at 4:07 PM, Alexander Shraer <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Such a commit will be rejected due to an old epoch. Ted, can you please point me to the place in the code where this check is performed ? Thanks a lot, Alex
-
Re: Possibility / consequences of having multiple elected leadersTed Dunning 2012-03-08, 23:12
Exactly.
On Thu, Mar 8, 2012 at 3:09 PM, Alexander Shraer <[EMAIL PROTECTED]>wrote: > Thanks Ted, I can see your point. We use TCP connections and we do the > epoch check at the beginning of the protocol, so **** > > a message from an old leader cannot just resurface. **** > > ** ** > > Alex**** > > ** ** > > *From:* Ted Dunning [mailto:[EMAIL PROTECTED]] > *Sent:* Thursday, March 08, 2012 12:32 AM > > *To:* Alexander Shraer > *Cc:* [EMAIL PROTECTED] > *Subject:* Re: Possibility / consequences of having multiple elected > leaders**** > > ** ** > > The whole point of the zab protocol is to ensure that only one elected > leader can exist at one time. Since a quorum has to commit to supporting > any leader there can't be two leaders. Furthermore each change of > leadership increments the epoch and that increment had to be committed on a > majority of node. That means that only one leader can exist in the latest > epoch. Since the latest epoch is, by definition, acknowledged by a majority > of nodes, an old leader cannot resurface as a pretender to the throne. > > Sent from my iPhone**** > > > On Mar 7, 2012, at 7:08 PM, Alexander Shraer <[EMAIL PROTECTED]> > wrote:**** > > I’ve been wondering about this for a while, and suspect that this check > doesn’t exist in the code… but I may be wrong.**** > > **** > > *From:* Ted Dunning [mailto:[EMAIL PROTECTED]] > *Sent:* Wednesday, March 07, 2012 4:55 PM > *To:* Alexander Shraer > *Cc:* [EMAIL PROTECTED] > *Subject:* Re: Possibility / consequences of having multiple elected > leaders**** > > **** > > Not off the cuff and I have to run away right now.**** > > **** > > On Wed, Mar 7, 2012 at 4:07 PM, Alexander Shraer <[EMAIL PROTECTED]> > wrote:**** > > > Such a commit will be rejected due to an old epoch. > > Ted, can you please point me to the place in the code where this check is > performed ? > > Thanks a lot, > Alex**** > > **** > > |