Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> Re: Possible issue with cluster availability following new Leader Election - ZK 3.4


Copy link to this message
-
Re: Possible issue with cluster availability following new Leader Election - ZK 3.4
This pretty much matches what I expect. It would be great if you
wanted to try your hand at creating a patch and submitting it to the
ticket that was created for this problem, but if not, please post this
analysis to issue 1465 and we'll look at it ASAP.

C

On Wed, May 16, 2012 at 2:55 PM, Vinayak Khot <[EMAIL PROTECTED]> wrote:
> We also have encountered a problem where the newly elected leader
> sends entire
> snapshot to a follower even though the follower is in sync with the leader.
>
> A closer look at the code shows the problem in the logic where we decide to
> send
> a snapshot.
> Following scenario explains the problem in details.
> Start a 3 node Zookeeper ensemble where every quorum member has seen same
> changes.
> zxid: *0x400000004*
>
> 1. When a newly elected leader starts, it bumps up its zxid to the new
> epoch.
>
> Code snippet Leader.java
>
> long epoch = getEpochToPropose(self.getId(), self.getAcceptedEpoch());
> zk.setZxid(ZxidUtils.makeZxid(epoch, 0));
> synchronized(this){
>     lastProposed = zk.getZxid();  // *0x500000000*
> }
>
> 2. Now a follower tries to join the leader with its peerLastZxid = *
> 0x400000004*
>
> Note that now the leader has in memory committedLog list with* *
> maxCommittedLog=*0x400000004** *
> *
> *
> As committedLog don't have any new transactions which have zxid >
> peerLastZxid, we check if
> the leader and follower are in sync.
>
> Code snippet from LearnerHandler.java
> leaderLastZxid = leader.startForwarding(this, updates);
> if (peerLastZxid == leaderLastZxid) {   *0x400000004 == **0x500000000*
>   // We are in sync so we'll do an empty diff
>   packetToSend = Leader.DIFF;
>   zxidToSend = leaderLastZxid;
> }
>
> Note that the function *leader.startForwarding()* returns *lastProposed *zxid
> which is already set to
> *0x500000000 *by the leader.
> So in this scenario we never send empty diff even though the leader and
> follower are in sync,
> and we end up sending entire snapshot in the code that follows above check.
>
> A possible fix would be to keep *lastProcessedZxid* in the leader which
> will get updated only when
> the leader processes a transaction. While syncing with a follower, if the
> peerLastZxid sent by a follower
> is same as lastProcessedZxid of the leader we can send empty diff to the
> follower.
> This shall avoid unnecessarily sending entire snapshot when the leader and
> follower are already in sync.
>
> Zookeeper developers please share your views on above mentioned issue.
>
> - Vinayak
>
> On Mon, May 14, 2012 at 8:30 AM, Camille Fournier <[EMAIL PROTECTED]>wrote:
>
>> Thanks.
>> I just ran a couple of tests to start the debugging. Mark, I don't see
>> a long cluster settle with a mostly empty data set, so I think this
>> might be two different problems. I do see a lot of snapshots being
>> sent though so there is probably some overaggressiveness in the way
>> that we evaluate when to send snapshots that should be evaluated.
>> Adding the dev mailing list, as I may need ben or flavio to take a
>> look as well.
>>
>> C
>>
>> On Thu, May 10, 2012 at 10:48 AM,  <[EMAIL PROTECTED]> wrote:
>> > Cheers - Raised https://issues.apache.org/jira/browse/ZOOKEEPER-1465
>> >
>> >
>> >
>> > -----Original Message-----
>> > From: Camille Fournier [mailto:[EMAIL PROTECTED]]
>> > Sent: 10 May 2012 14:58
>> > To: [EMAIL PROTECTED]
>> > Subject: Re: Possible issue with cluster availability following new
>> Leader Election - ZK 3.4
>> >
>> > I will take a look at this soon, have you created a Jira for it? If not
>> please do so.
>> >
>> > Thanks,
>> > C
>> >
>> > On Thu, May 10, 2012 at 7:20 AM,  <[EMAIL PROTECTED]> wrote:
>> >> I think there may be a problem here with the 3.4 branch. I dropped the
>> >> cluster back to 3.3.5 and the behaviour was much better.
>> >>
>> >> To summarize:
>> >>
>> >> 650mb of data
>> >> 20k nodes of varied size
>> >> 3 node cluster
>> >>
>> >> On 3.4.x (using latest branch build)
>> >> ---------
>> >> Takes 3-4 minutes to bring up a cluster from cold Takes 40-50 secs to
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB