Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - leader election, scheduled tasks, losing leadership


Copy link to this message
-
Re: leader election, scheduled tasks, losing leadership
Eric Pederson 2012-12-09, 22:10
Also sometimes the app leadership (via LeaderLatch) will get lost - I will
follow up about this on the Curator list:
https://gist.github.com/4247226

So back to my previous question, what is the best way to implement the
"fence"?

-- Eric

On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[EMAIL PROTECTED]> wrote:

> The irony is that I am using leader election to convert non-idempotent
> operations into idempotent operations :)   For example, once a night a
> report is emailed out to a set of addresses.   We don't want the report to
> go to the same person more than once.
>
> Prior to using leader election one of the cluster members was designated
> as the scheduled task "leader" using a system property.  But if that
> cluster member crashed it required a manual operation to failover the
> "leader" responsibility to another cluster member.   I considered using
> app-specific techniques to make the scheduled tasks idempotent (for example
> using "select for update" / database locking) but I wanted a general
> solution and I needed clustering support for other reasons (cluster
> membership, etc).
>
> Anyway, here is the code that I'm using.
>
> Application startup (using Curator LeaderLatch):
> https://gist.github.com/3936162
> https://gist.github.com/3935895
> https://gist.github.com/3935889
>
> ClusterStatus:
> https://gist.github.com/3943149
> https://gist.github.com/3935861
>
> Scheduled task:
> https://gist.github.com/4246388
>
> In the last gist the "distribute" scheduled task is run every 30 seconds.
>   It checks clusterStatus.isLeader to see if the current cluster member is
> the leader before running the real method (which sends email).
> clusterStatus() calls methods on LeaderLatch.
>
> Here is the output that I am seeing if I kill the ZK quorum leader and the
> app cluster member that was the leader loses its LeaderLatch leadership to
> another cluster member:
> https://gist.github.com/4247058
>
>
> -- Eric
>
>
>
> On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <[EMAIL PROTECTED]>wrote:
>
>> On 8 December 2012 21:18, Jordan Zimmerman <[EMAIL PROTECTED]
>> >wrote:
>>
>> > If your ConnectionStateListener gets SUSPENDED or LOST you've lost
>> > connection to ZooKeeper. Therefore you cannot use that same ZooKeeper
>> > connection to manage a node that denotes the process is running or not.
>> > Only 1 VM at a time will be running the process. That process can watch
>> for
>> > SUSPENDED/LOST and wind down the task.
>> >
>> >
>> My point is that by the time that VM sees SUSPENDED/LOST, another VM may
>> have been elected leader and have started running another process.
>>
>> It's a classic problem - you need some mechanism to fence a node that
>> thinks its the leader, but isn't and hasn't got the memo yet. The way
>> around the problem is to either ensure that no work is done by you once
>> you
>> are no longer the leader (perhaps by checking every time you want to do
>> work), or that the work you do does not affect the system (e.g. by
>> idempotent work units).
>>
>> ZK itself solves this internally by checking with that it has a quorum for
>> every operation, which forces an ordering between the disconnection event
>> and trying to do something that relies upon being the leader. Other
>> systems
>> forcibly terminate old leaders before allowing a new leader to take the
>> throne.
>>
>> Henry
>>
>>
>> > > You can't assume that the notification is received locally before
>> another
>> > > leader election finishes elsewhere
>> > Which notification? The ConnectionStateListener is an abstraction on
>> > ZooKeeper's watcher mechanism. It's only significant for the VM that is
>> the
>> > leader. Non-leaders don't need to be concerned.
>>
>>
>> > -JZ
>> >
>> > On Dec 8, 2012, at 9:12 PM, Henry Robinson <[EMAIL PROTECTED]> wrote:
>> >
>> > > You can't assume that the notification is received locally before
>> another
>> > > leader election finishes elsewhere (particularly if you are running
>> > slowly
>> > > for some reason!), so it's not sufficient to guarantee that the