Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # dev - Re: Possible issue with cluster availability following new Leader Election - ZK 3.4


Copy link to this message
-
Re: Possible issue with cluster availability following new Leader Election - ZK 3.4
Camille Fournier 2012-05-17, 02:50
This pretty much matches what I expect. It would be great if you
wanted to try your hand at creating a patch and submitting it to the
ticket that was created for this problem, but if not, please post this
analysis to issue 1465 and we'll look at it ASAP.

C

On Wed, May 16, 2012 at 2:55 PM, Vinayak Khot <[EMAIL PROTECTED]> wrote:
> We also have encountered a problem where the newly elected leader
> sends entire
> snapshot to a follower even though the follower is in sync with the leader.
>
> A closer look at the code shows the problem in the logic where we decide to
> send
> a snapshot.
> Following scenario explains the problem in details.
> Start a 3 node Zookeeper ensemble where every quorum member has seen same
> changes.
> zxid: *0x400000004*
>
> 1. When a newly elected leader starts, it bumps up its zxid to the new
> epoch.
>
> Code snippet Leader.java
>
> long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
> zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
> synchronized(this){
>     lastProposed = zk.getZxid();  // *0x500000000*
> }
>
> 2. Now a follower tries to join the leader with its peerLastZxid = *
> 0x400000004*
>
> Note that now the leader has in memory committedLog list with* *
> maxCommittedLog=*0x400000004** *
> *
> *
> As committedLog don't have any new transactions which have zxid >
> peerLastZxid, we check if
> the leader and follower are in sync.
>
> Code snippet from LearnerHandler.java
> leaderLastZxid = leader.startForwarding(this, updates);
> if (peerLastZxid == leaderLastZxid) {   *0x400000004 == **0x500000000*
>   // We are in sync so we'll do an empty diff
>   packetToSend = Leader.DIFF;
>   zxidToSend = leaderLastZxid;
> }
>
> Note that the function *leader.startForwarding()* returns *lastProposed *zxid
> which is already set to
> *0x500000000 *by the leader.
> So in this scenario we never send empty diff even though the leader and
> follower are in sync,
> and we end up sending entire snapshot in the code that follows above check.
>
> A possible fix would be to keep *lastProcessedZxid* in the leader which
> will get updated only when
> the leader processes a transaction. While syncing with a follower, if the
> peerLastZxid sent by a follower
> is same as lastProcessedZxid of the leader we can send empty diff to the
> follower.
> This shall avoid unnecessarily sending entire snapshot when the leader and
> follower are already in sync.
>
> Zookeeper developers please share your views on above mentioned issue.
>
> - Vinayak
>
> On Mon, May 14, 2012 at 8:30 AM, Camille Fournier <[EMAIL PROTECTED]>wrote:
>
>> Thanks.
>> I just ran a couple of tests to start the debugging. Mark, I don't see
>> a long cluster settle with a mostly empty data set, so I think this
>> might be two different problems. I do see a lot of snapshots being
>> sent though so there is probably some overaggressiveness in the way
>> that we evaluate when to send snapshots that should be evaluated.
>> Adding the dev mailing list, as I may need ben or flavio to take a
>> look as well.
>>
>> C
>>
>> On Thu, May 10, 2012 at 10:48 AM,  <[EMAIL PROTECTED]> wrote:
>> > Cheers - Raised https://issues.apache.org/jira/browse/ZOOKEEPER-1465
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Camille Fournier [mailto:[EMAIL PROTECTED]]
>> > Sent: 10 May 2012 14:58
>> > To: [EMAIL PROTECTED]
>> > Subject: Re: Possible issue with cluster availability following new
>> Leader Election - ZK 3.4
>> >
>> > I will take a look at this soon, have you created a Jira for it? If not
>> please do so.
>> >
>> > Thanks,
>> > C
>> >
>> > On Thu, May 10, 2012 at 7:20 AM,  <[EMAIL PROTECTED]> wrote:
>> >> I think there may be a problem here with the 3.4 branch. I dropped the
>> >> cluster back to 3.3.5 and the behaviour was much better.
>> >>
>> >> To summarize:
>> >>
>> >> 650mb of data
>> >> 20k nodes of varied size
>> >> 3 node cluster
>> >>
>> >> On 3.4.x (using latest branch build)
>> >> ---------
>> >> Takes 3-4 minutes to bring up a cluster from cold Takes 40-50 secs to