|
Eric Pederson
2012-12-09, 04:17
Jordan Zimmerman
2012-12-09, 04:25
Eric Pederson
2012-12-09, 04:49
Jordan Zimmerman
2012-12-09, 04:52
Eric Pederson
2012-12-09, 04:54
Eric Pederson
2012-12-09, 04:56
Jordan Zimmerman
2012-12-09, 04:57
Henry Robinson
2012-12-09, 04:59
Jordan Zimmerman
2012-12-09, 05:00
Eric Pederson
2012-12-09, 05:00
Henry Robinson
2012-12-09, 05:02
Jordan Zimmerman
2012-12-09, 05:04
Henry Robinson
2012-12-09, 05:12
Jordan Zimmerman
2012-12-09, 05:18
Henry Robinson
2012-12-09, 05:30
Jordan Zimmerman
2012-12-09, 05:41
Eric Pederson
2012-12-09, 21:42
Eric Pederson
2012-12-09, 22:10
Vitalii Tymchyshyn
2012-12-10, 06:49
Eric Pederson
2012-12-10, 11:52
Vitalii Tymchyshyn
2012-12-11, 20:09
Eric Pederson
2012-12-12, 00:54
|
-
leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 04:17
Hi all:
In my system I have scheduled tasks that only one cluster member should run. I am using the leader election recipe to determine which cluster member should run the scheduled tasks. The way it works is that every cluster member has the scheduler running. At the time a scheduled job starts all cluster members execute the same method. It first checks if the current node is the leader. If it is it goes ahead and runs the task. Otherwise the method returns. The tasks themselves can take a few milliseconds up to tens of minutes. During the time the task is running a cluster member could lose its leadership. I don't want another cluster member to start running a scheduled leader-only task until the first one is finished. At first I considered using an ephemeral node as a flag to indicate "task in progress" and changing the logic for starting a scheduled task to be "if I am the leader AND no task is currently in progress". However, if the znode is ephemeral it could get lost the same way the leadership was lost. On the other hand if I use a non-ephemeral node I need to add logic to check for stale/invalid "task in progress" nodes (check for staleness plus try to contact the node that is running the task to see if it responds). Am I correct in assuming that I cannot use an ephemeral node for the "task in progress" flag? And that a non-ephemeral node with stale checking is the way to go? This seems like a pretty common use case. Thanks, -- Eric
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 04:25
Why would it lose leadership? The only reason I can think of is if the ZK cluster goes down. In normal use, the ZK cluster won't go down (I assume you're running 3 or 5 instances).
-JZ On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > During the time the task is running a cluster member could lose its > leadership.
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 04:49
Actually I had the same thought and didn't consider having to do this until
I talked about my project at a Zookeeper User Group a month or so ago and I was given this advice. I know that I do see leadership being lost/transferred when one of the ZK servers is restarted (not the whole ensemble). And it seems like I've seen it happen even when the ensemble stays totally stable (though I am not 100% sure as it's been a while since I have worked on this particular application). -- Eric On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < [EMAIL PROTECTED]> wrote: > Why would it lose leadership? The only reason I can think of is if the ZK > cluster goes down. In normal use, the ZK cluster won't go down (I assume > you're running 3 or 5 instances). > > -JZ > > On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > During the time the task is running a cluster member could lose its > > leadership. > >
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 04:52
You might check your leader implementation. Writing a correct leader recipe is actually quite challenging due to edge cases. Have a look at Curator (disclosure: I wrote it) for an example.
-JZ On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > Actually I had the same thought and didn't consider having to do this until > I talked about my project at a Zookeeper User Group a month or so ago and I > was given this advice. > > I know that I do see leadership being lost/transferred when one of the ZK > servers is restarted (not the whole ensemble). And it seems like I've > seen it happen even when the ensemble stays totally stable (though I am not > 100% sure as it's been a while since I have worked on this particular > application). > > > > -- Eric > > > > On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < > [EMAIL PROTECTED]> wrote: > >> Why would it lose leadership? The only reason I can think of is if the ZK >> cluster goes down. In normal use, the ZK cluster won't go down (I assume >> you're running 3 or 5 instances). >> >> -JZ >> >> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >> >>> During the time the task is running a cluster member could lose its >>> leadership. >> >>
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 04:54
I am using Curator LeaderLatch :)
-- Eric On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < [EMAIL PROTECTED]> wrote: > You might check your leader implementation. Writing a correct leader > recipe is actually quite challenging due to edge cases. Have a look at > Curator (disclosure: I wrote it) for an example. > > -JZ > > On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > Actually I had the same thought and didn't consider having to do this > until > > I talked about my project at a Zookeeper User Group a month or so ago > and I > > was given this advice. > > > > I know that I do see leadership being lost/transferred when one of the ZK > > servers is restarted (not the whole ensemble). And it seems like I've > > seen it happen even when the ensemble stays totally stable (though I am > not > > 100% sure as it's been a while since I have worked on this particular > > application). > > > > > > > > -- Eric > > > > > > > > On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < > > [EMAIL PROTECTED]> wrote: > > > >> Why would it lose leadership? The only reason I can think of is if the > ZK > >> cluster goes down. In normal use, the ZK cluster won't go down (I assume > >> you're running 3 or 5 instances). > >> > >> -JZ > >> > >> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >> > >>> During the time the task is running a cluster member could lose its > >>> leadership. > >> > >> > >
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 04:56
If I recall correctly it was Henry Robinson that gave me the advice to have
a "task in progress" check. -- Eric On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > I am using Curator LeaderLatch :) > > > -- Eric > > > > > On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > [EMAIL PROTECTED]> wrote: > >> You might check your leader implementation. Writing a correct leader >> recipe is actually quite challenging due to edge cases. Have a look at >> Curator (disclosure: I wrote it) for an example. >> >> -JZ >> >> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >> >> > Actually I had the same thought and didn't consider having to do this >> until >> > I talked about my project at a Zookeeper User Group a month or so ago >> and I >> > was given this advice. >> > >> > I know that I do see leadership being lost/transferred when one of the >> ZK >> > servers is restarted (not the whole ensemble). And it seems like I've >> > seen it happen even when the ensemble stays totally stable (though I am >> not >> > 100% sure as it's been a while since I have worked on this particular >> > application). >> > >> > >> > >> > -- Eric >> > >> > >> > >> > On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < >> > [EMAIL PROTECTED]> wrote: >> > >> >> Why would it lose leadership? The only reason I can think of is if the >> ZK >> >> cluster goes down. In normal use, the ZK cluster won't go down (I >> assume >> >> you're running 3 or 5 instances). >> >> >> >> -JZ >> >> >> >> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >> >> >> >>> During the time the task is running a cluster member could lose its >> >>> leadership. >> >> >> >> >> >> >
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 04:57
lol - I have an open bug for LeaderLatch. hasLeadership() is returning true when the cluster becomes unstable: https://github.com/Netflix/curator/issues/216 - In any event, you should register a ConnectionStateListener in case the cluster loses quorum.
-JZ On Dec 8, 2012, at 8:54 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > I am using Curator LeaderLatch :) > > > -- Eric > > > > On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > [EMAIL PROTECTED]> wrote: > >> You might check your leader implementation. Writing a correct leader >> recipe is actually quite challenging due to edge cases. Have a look at >> Curator (disclosure: I wrote it) for an example. >> >> -JZ >> >> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >> >>> Actually I had the same thought and didn't consider having to do this >> until >>> I talked about my project at a Zookeeper User Group a month or so ago >> and I >>> was given this advice. >>> >>> I know that I do see leadership being lost/transferred when one of the ZK >>> servers is restarted (not the whole ensemble). And it seems like I've >>> seen it happen even when the ensemble stays totally stable (though I am >> not >>> 100% sure as it's been a while since I have worked on this particular >>> application). >>> >>> >>> >>> -- Eric >>> >>> >>> >>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Why would it lose leadership? The only reason I can think of is if the >> ZK >>>> cluster goes down. In normal use, the ZK cluster won't go down (I assume >>>> you're running 3 or 5 instances). >>>> >>>> -JZ >>>> >>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>>> >>>>> During the time the task is running a cluster member could lose its >>>>> leadership. >>>> >>>> >> >>
-
Re: leader election, scheduled tasks, losing leadershipHenry Robinson 2012-12-09, 04:59
Eric -
Might be helpful if you post the code that you had as one of your slides so that we can see exactly what you're doing. Henry On 8 December 2012 20:56, Eric Pederson <[EMAIL PROTECTED]> wrote: > If I recall correctly it was Henry Robinson that gave me the advice to have > a "task in progress" check. > > > -- Eric > > > > On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > I am using Curator LeaderLatch :) > > > > > > -- Eric > > > > > > > > > > On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > > [EMAIL PROTECTED]> wrote: > > > >> You might check your leader implementation. Writing a correct leader > >> recipe is actually quite challenging due to edge cases. Have a look at > >> Curator (disclosure: I wrote it) for an example. > >> > >> -JZ > >> > >> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >> > >> > Actually I had the same thought and didn't consider having to do this > >> until > >> > I talked about my project at a Zookeeper User Group a month or so ago > >> and I > >> > was given this advice. > >> > > >> > I know that I do see leadership being lost/transferred when one of the > >> ZK > >> > servers is restarted (not the whole ensemble). And it seems like > I've > >> > seen it happen even when the ensemble stays totally stable (though I > am > >> not > >> > 100% sure as it's been a while since I have worked on this particular > >> > application). > >> > > >> > > >> > > >> > -- Eric > >> > > >> > > >> > > >> > On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < > >> > [EMAIL PROTECTED]> wrote: > >> > > >> >> Why would it lose leadership? The only reason I can think of is if > the > >> ZK > >> >> cluster goes down. In normal use, the ZK cluster won't go down (I > >> assume > >> >> you're running 3 or 5 instances). > >> >> > >> >> -JZ > >> >> > >> >> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >> >> > >> >>> During the time the task is running a cluster member could lose its > >> >>> leadership. > >> >> > >> >> > >> > >> > > > -- Henry Robinson Software Engineer Cloudera 415-994-6679
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 05:00
The leader latch lock is the equivalent of task in progress. I assume the task is running in the same VM as the leader lock. The only reason the VM would lose leadership is if it crashes in which case the process would die anyway.
-JZ On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > If I recall correctly it was Henry Robinson that gave me the advice to have > a "task in progress" check. > > > -- Eric > > > > On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >> I am using Curator LeaderLatch :) >> >> >> -- Eric >> >> >> >> >> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < >> [EMAIL PROTECTED]> wrote: >> >>> You might check your leader implementation. Writing a correct leader >>> recipe is actually quite challenging due to edge cases. Have a look at >>> Curator (disclosure: I wrote it) for an example. >>> >>> -JZ >>> >>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>> >>>> Actually I had the same thought and didn't consider having to do this >>> until >>>> I talked about my project at a Zookeeper User Group a month or so ago >>> and I >>>> was given this advice. >>>> >>>> I know that I do see leadership being lost/transferred when one of the >>> ZK >>>> servers is restarted (not the whole ensemble). And it seems like I've >>>> seen it happen even when the ensemble stays totally stable (though I am >>> not >>>> 100% sure as it's been a while since I have worked on this particular >>>> application). >>>> >>>> >>>> >>>> -- Eric >>>> >>>> >>>> >>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Why would it lose leadership? The only reason I can think of is if the >>> ZK >>>>> cluster goes down. In normal use, the ZK cluster won't go down (I >>> assume >>>>> you're running 3 or 5 instances). >>>>> >>>>> -JZ >>>>> >>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> During the time the task is running a cluster member could lose its >>>>>> leadership. >>>>> >>>>> >>> >>> >>
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 05:00
Ok - I'll do that tomorrow. Thanks guys.
-- Eric On Sat, Dec 8, 2012 at 11:59 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > Eric - > > Might be helpful if you post the code that you had as one of your slides so > that we can see exactly what you're doing. > > Henry > > On 8 December 2012 20:56, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > If I recall correctly it was Henry Robinson that gave me the advice to > have > > a "task in progress" check. > > > > > > -- Eric > > > > > > > > On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> > wrote: > > > > > I am using Curator LeaderLatch :) > > > > > > > > > -- Eric > > > > > > > > > > > > > > > On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > > > [EMAIL PROTECTED]> wrote: > > > > > >> You might check your leader implementation. Writing a correct leader > > >> recipe is actually quite challenging due to edge cases. Have a look at > > >> Curator (disclosure: I wrote it) for an example. > > >> > > >> -JZ > > >> > > >> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > >> > > >> > Actually I had the same thought and didn't consider having to do > this > > >> until > > >> > I talked about my project at a Zookeeper User Group a month or so > ago > > >> and I > > >> > was given this advice. > > >> > > > >> > I know that I do see leadership being lost/transferred when one of > the > > >> ZK > > >> > servers is restarted (not the whole ensemble). And it seems like > > I've > > >> > seen it happen even when the ensemble stays totally stable (though I > > am > > >> not > > >> > 100% sure as it's been a while since I have worked on this > particular > > >> > application). > > >> > > > >> > > > >> > > > >> > -- Eric > > >> > > > >> > > > >> > > > >> > On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < > > >> > [EMAIL PROTECTED]> wrote: > > >> > > > >> >> Why would it lose leadership? The only reason I can think of is if > > the > > >> ZK > > >> >> cluster goes down. In normal use, the ZK cluster won't go down (I > > >> assume > > >> >> you're running 3 or 5 instances). > > >> >> > > >> >> -JZ > > >> >> > > >> >> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> > wrote: > > >> >> > > >> >>> During the time the task is running a cluster member could lose > its > > >> >>> leadership. > > >> >> > > >> >> > > >> > > >> > > > > > > > > > -- > Henry Robinson > Software Engineer > Cloudera > 415-994-6679 >
-
Re: leader election, scheduled tasks, losing leadershipHenry Robinson 2012-12-09, 05:02
What about a network disconnection? Presumably leadership is revoked when
the leader appears to have failed, which can be for more reasons than a VM crash (VM running slow, network event, GC pause etc). Henry On 8 December 2012 21:00, Jordan Zimmerman <[EMAIL PROTECTED]>wrote: > The leader latch lock is the equivalent of task in progress. I assume the > task is running in the same VM as the leader lock. The only reason the VM > would lose leadership is if it crashes in which case the process would die > anyway. > > -JZ > > On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > If I recall correctly it was Henry Robinson that gave me the advice to > have > > a "task in progress" check. > > > > > > -- Eric > > > > > > > > On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> > wrote: > > > >> I am using Curator LeaderLatch :) > >> > >> > >> -- Eric > >> > >> > >> > >> > >> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > >> [EMAIL PROTECTED]> wrote: > >> > >>> You might check your leader implementation. Writing a correct leader > >>> recipe is actually quite challenging due to edge cases. Have a look at > >>> Curator (disclosure: I wrote it) for an example. > >>> > >>> -JZ > >>> > >>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >>> > >>>> Actually I had the same thought and didn't consider having to do this > >>> until > >>>> I talked about my project at a Zookeeper User Group a month or so ago > >>> and I > >>>> was given this advice. > >>>> > >>>> I know that I do see leadership being lost/transferred when one of the > >>> ZK > >>>> servers is restarted (not the whole ensemble). And it seems like > I've > >>>> seen it happen even when the ensemble stays totally stable (though I > am > >>> not > >>>> 100% sure as it's been a while since I have worked on this particular > >>>> application). > >>>> > >>>> > >>>> > >>>> -- Eric > >>>> > >>>> > >>>> > >>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < > >>>> [EMAIL PROTECTED]> wrote: > >>>> > >>>>> Why would it lose leadership? The only reason I can think of is if > the > >>> ZK > >>>>> cluster goes down. In normal use, the ZK cluster won't go down (I > >>> assume > >>>>> you're running 3 or 5 instances). > >>>>> > >>>>> -JZ > >>>>> > >>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>>> During the time the task is running a cluster member could lose its > >>>>>> leadership. > >>>>> > >>>>> > >>> > >>> > >> > > -- Henry Robinson Software Engineer Cloudera 415-994-6679
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 05:04
This is why you need a ConnectionStateListener. You'll get a notice that the connection has been suspended and you should assume all locks/leaders are invalid.
-JZ On Dec 8, 2012, at 9:02 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > What about a network disconnection? Presumably leadership is revoked when > the leader appears to have failed, which can be for more reasons than a VM > crash (VM running slow, network event, GC pause etc). > > Henry > > On 8 December 2012 21:00, Jordan Zimmerman <[EMAIL PROTECTED]>wrote: > >> The leader latch lock is the equivalent of task in progress. I assume the >> task is running in the same VM as the leader lock. The only reason the VM >> would lose leadership is if it crashes in which case the process would die >> anyway. >> >> -JZ >> >> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >> >>> If I recall correctly it was Henry Robinson that gave me the advice to >> have >>> a "task in progress" check. >>> >>> >>> -- Eric >>> >>> >>> >>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> >> wrote: >>> >>>> I am using Curator LeaderLatch :) >>>> >>>> >>>> -- Eric >>>> >>>> >>>> >>>> >>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> You might check your leader implementation. Writing a correct leader >>>>> recipe is actually quite challenging due to edge cases. Have a look at >>>>> Curator (disclosure: I wrote it) for an example. >>>>> >>>>> -JZ >>>>> >>>>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Actually I had the same thought and didn't consider having to do this >>>>> until >>>>>> I talked about my project at a Zookeeper User Group a month or so ago >>>>> and I >>>>>> was given this advice. >>>>>> >>>>>> I know that I do see leadership being lost/transferred when one of the >>>>> ZK >>>>>> servers is restarted (not the whole ensemble). And it seems like >> I've >>>>>> seen it happen even when the ensemble stays totally stable (though I >> am >>>>> not >>>>>> 100% sure as it's been a while since I have worked on this particular >>>>>> application). >>>>>> >>>>>> >>>>>> >>>>>> -- Eric >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < >>>>>> [EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> Why would it lose leadership? The only reason I can think of is if >> the >>>>> ZK >>>>>>> cluster goes down. In normal use, the ZK cluster won't go down (I >>>>> assume >>>>>>> you're running 3 or 5 instances). >>>>>>> >>>>>>> -JZ >>>>>>> >>>>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>>> During the time the task is running a cluster member could lose its >>>>>>>> leadership. >>>>>>> >>>>>>> >>>>> >>>>> >>>> >> >> > > > -- > Henry Robinson > Software Engineer > Cloudera > 415-994-6679
-
Re: leader election, scheduled tasks, losing leadershipHenry Robinson 2012-12-09, 05:12
You can't assume that the notification is received locally before another
leader election finishes elsewhere (particularly if you are running slowly for some reason!), so it's not sufficient to guarantee that the process that is running locally has finished before someone else starts another. It's usually best - if possible - to restructure the system so that processes are idempotent to work around these kinds of problem, in conjunction with using the kind of primitives that Curator provides. Henry On 8 December 2012 21:04, Jordan Zimmerman <[EMAIL PROTECTED]>wrote: > This is why you need a ConnectionStateListener. You'll get a notice that > the connection has been suspended and you should assume all locks/leaders > are invalid. > > -JZ > > On Dec 8, 2012, at 9:02 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > > > What about a network disconnection? Presumably leadership is revoked when > > the leader appears to have failed, which can be for more reasons than a > VM > > crash (VM running slow, network event, GC pause etc). > > > > Henry > > > > On 8 December 2012 21:00, Jordan Zimmerman <[EMAIL PROTECTED] > >wrote: > > > >> The leader latch lock is the equivalent of task in progress. I assume > the > >> task is running in the same VM as the leader lock. The only reason the > VM > >> would lose leadership is if it crashes in which case the process would > die > >> anyway. > >> > >> -JZ > >> > >> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >> > >>> If I recall correctly it was Henry Robinson that gave me the advice to > >> have > >>> a "task in progress" check. > >>> > >>> > >>> -- Eric > >>> > >>> > >>> > >>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> > >> wrote: > >>> > >>>> I am using Curator LeaderLatch :) > >>>> > >>>> > >>>> -- Eric > >>>> > >>>> > >>>> > >>>> > >>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > >>>> [EMAIL PROTECTED]> wrote: > >>>> > >>>>> You might check your leader implementation. Writing a correct leader > >>>>> recipe is actually quite challenging due to edge cases. Have a look > at > >>>>> Curator (disclosure: I wrote it) for an example. > >>>>> > >>>>> -JZ > >>>>> > >>>>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >>>>> > >>>>>> Actually I had the same thought and didn't consider having to do > this > >>>>> until > >>>>>> I talked about my project at a Zookeeper User Group a month or so > ago > >>>>> and I > >>>>>> was given this advice. > >>>>>> > >>>>>> I know that I do see leadership being lost/transferred when one of > the > >>>>> ZK > >>>>>> servers is restarted (not the whole ensemble). And it seems like > >> I've > >>>>>> seen it happen even when the ensemble stays totally stable (though I > >> am > >>>>> not > >>>>>> 100% sure as it's been a while since I have worked on this > particular > >>>>>> application). > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- Eric > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < > >>>>>> [EMAIL PROTECTED]> wrote: > >>>>>> > >>>>>>> Why would it lose leadership? The only reason I can think of is if > >> the > >>>>> ZK > >>>>>>> cluster goes down. In normal use, the ZK cluster won't go down (I > >>>>> assume > >>>>>>> you're running 3 or 5 instances). > >>>>>>> > >>>>>>> -JZ > >>>>>>> > >>>>>>> On Dec 8, 2012, at 8:17 PM, Eric Pederson <[EMAIL PROTECTED]> > wrote: > >>>>>>> > >>>>>>>> During the time the task is running a cluster member could lose > its > >>>>>>>> leadership. > >>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >> > >> > > > > > > -- > > Henry Robinson > > Software Engineer > > Cloudera > > 415-994-6679 > > -- Henry Robinson Software Engineer Cloudera 415-994-6679
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 05:18
If your ConnectionStateListener gets SUSPENDED or LOST you've lost connection to ZooKeeper. Therefore you cannot use that same ZooKeeper connection to manage a node that denotes the process is running or not. Only 1 VM at a time will be running the process. That process can watch for SUSPENDED/LOST and wind down the task.
> You can't assume that the notification is received locally before another > leader election finishes elsewhere Which notification? The ConnectionStateListener is an abstraction on ZooKeeper's watcher mechanism. It's only significant for the VM that is the leader. Non-leaders don't need to be concerned. -JZ On Dec 8, 2012, at 9:12 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > You can't assume that the notification is received locally before another > leader election finishes elsewhere (particularly if you are running slowly > for some reason!), so it's not sufficient to guarantee that the process > that is running locally has finished before someone else starts another. > > It's usually best - if possible - to restructure the system so that > processes are idempotent to work around these kinds of problem, in > conjunction with using the kind of primitives that Curator provides. > > Henry > > On 8 December 2012 21:04, Jordan Zimmerman <[EMAIL PROTECTED]>wrote: > >> This is why you need a ConnectionStateListener. You'll get a notice that >> the connection has been suspended and you should assume all locks/leaders >> are invalid. >> >> -JZ >> >> On Dec 8, 2012, at 9:02 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: >> >>> What about a network disconnection? Presumably leadership is revoked when >>> the leader appears to have failed, which can be for more reasons than a >> VM >>> crash (VM running slow, network event, GC pause etc). >>> >>> Henry >>> >>> On 8 December 2012 21:00, Jordan Zimmerman <[EMAIL PROTECTED] >>> wrote: >>> >>>> The leader latch lock is the equivalent of task in progress. I assume >> the >>>> task is running in the same VM as the leader lock. The only reason the >> VM >>>> would lose leadership is if it crashes in which case the process would >> die >>>> anyway. >>>> >>>> -JZ >>>> >>>> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>>> >>>>> If I recall correctly it was Henry Robinson that gave me the advice to >>>> have >>>>> a "task in progress" check. >>>>> >>>>> >>>>> -- Eric >>>>> >>>>> >>>>> >>>>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>>> I am using Curator LeaderLatch :) >>>>>> >>>>>> >>>>>> -- Eric >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < >>>>>> [EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> You might check your leader implementation. Writing a correct leader >>>>>>> recipe is actually quite challenging due to edge cases. Have a look >> at >>>>>>> Curator (disclosure: I wrote it) for an example. >>>>>>> >>>>>>> -JZ >>>>>>> >>>>>>> On Dec 8, 2012, at 8:49 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: >>>>>>> >>>>>>>> Actually I had the same thought and didn't consider having to do >> this >>>>>>> until >>>>>>>> I talked about my project at a Zookeeper User Group a month or so >> ago >>>>>>> and I >>>>>>>> was given this advice. >>>>>>>> >>>>>>>> I know that I do see leadership being lost/transferred when one of >> the >>>>>>> ZK >>>>>>>> servers is restarted (not the whole ensemble). And it seems like >>>> I've >>>>>>>> seen it happen even when the ensemble stays totally stable (though I >>>> am >>>>>>> not >>>>>>>> 100% sure as it's been a while since I have worked on this >> particular >>>>>>>> application). >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- Eric >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Dec 8, 2012 at 11:25 PM, Jordan Zimmerman < >>>>>>>> [EMAIL PROTECTED]> wrote: >>>>>>>> >>>>>>>>> Why would it lose leadership? The only reason I can think of is if >>>> the >>>>>>> ZK >>>>>>>>> cluster goes down. In normal use, the ZK cluster won't go down (I
-
Re: leader election, scheduled tasks, losing leadershipHenry Robinson 2012-12-09, 05:30
On 8 December 2012 21:18, Jordan Zimmerman <[EMAIL PROTECTED]>wrote:
> If your ConnectionStateListener gets SUSPENDED or LOST you've lost > connection to ZooKeeper. Therefore you cannot use that same ZooKeeper > connection to manage a node that denotes the process is running or not. > Only 1 VM at a time will be running the process. That process can watch for > SUSPENDED/LOST and wind down the task. > > My point is that by the time that VM sees SUSPENDED/LOST, another VM may have been elected leader and have started running another process. It's a classic problem - you need some mechanism to fence a node that thinks its the leader, but isn't and hasn't got the memo yet. The way around the problem is to either ensure that no work is done by you once you are no longer the leader (perhaps by checking every time you want to do work), or that the work you do does not affect the system (e.g. by idempotent work units). ZK itself solves this internally by checking with that it has a quorum for every operation, which forces an ordering between the disconnection event and trying to do something that relies upon being the leader. Other systems forcibly terminate old leaders before allowing a new leader to take the throne. Henry > > You can't assume that the notification is received locally before another > > leader election finishes elsewhere > Which notification? The ConnectionStateListener is an abstraction on > ZooKeeper's watcher mechanism. It's only significant for the VM that is the > leader. Non-leaders don't need to be concerned. > -JZ > > On Dec 8, 2012, at 9:12 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > > > You can't assume that the notification is received locally before another > > leader election finishes elsewhere (particularly if you are running > slowly > > for some reason!), so it's not sufficient to guarantee that the process > > that is running locally has finished before someone else starts another. > > > > It's usually best - if possible - to restructure the system so that > > processes are idempotent to work around these kinds of problem, in > > conjunction with using the kind of primitives that Curator provides. > > > > Henry > > > > On 8 December 2012 21:04, Jordan Zimmerman <[EMAIL PROTECTED] > >wrote: > > > >> This is why you need a ConnectionStateListener. You'll get a notice that > >> the connection has been suspended and you should assume all > locks/leaders > >> are invalid. > >> > >> -JZ > >> > >> On Dec 8, 2012, at 9:02 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > >> > >>> What about a network disconnection? Presumably leadership is revoked > when > >>> the leader appears to have failed, which can be for more reasons than a > >> VM > >>> crash (VM running slow, network event, GC pause etc). > >>> > >>> Henry > >>> > >>> On 8 December 2012 21:00, Jordan Zimmerman <[EMAIL PROTECTED] > >>> wrote: > >>> > >>>> The leader latch lock is the equivalent of task in progress. I assume > >> the > >>>> task is running in the same VM as the leader lock. The only reason the > >> VM > >>>> would lose leadership is if it crashes in which case the process would > >> die > >>>> anyway. > >>>> > >>>> -JZ > >>>> > >>>> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > >>>> > >>>>> If I recall correctly it was Henry Robinson that gave me the advice > to > >>>> have > >>>>> a "task in progress" check. > >>>>> > >>>>> > >>>>> -- Eric > >>>>> > >>>>> > >>>>> > >>>>> On Sat, Dec 8, 2012 at 11:54 PM, Eric Pederson <[EMAIL PROTECTED]> > >>>> wrote: > >>>>> > >>>>>> I am using Curator LeaderLatch :) > >>>>>> > >>>>>> > >>>>>> -- Eric > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>>> On Sat, Dec 8, 2012 at 11:52 PM, Jordan Zimmerman < > >>>>>> [EMAIL PROTECTED]> wrote: > >>>>>> > >>>>>>> You might check your leader implementation. Writing a correct > leader > >>>>>>> recipe is actually quite challenging due to edge cases. Have a look > >> at > >>>>>>> Curator (disclosure: I wrote it) for an example. Henry Robinson Software Engineer Cloudera 415-994-6679
-
Re: leader election, scheduled tasks, losing leadershipJordan Zimmerman 2012-12-09, 05:41
> My point is that by the time that VM sees SUSPENDED/LOST, another VM may
> have been elected leader and have started running another process. There's no way around this, right? ZK is not a transactional system so this edge-case is unsolvable. > The way > around the problem is to either ensure that no work is done by you once you > are no longer the leader You only release leadership when your work is done. If the cluster becomes unstable then you cancel your work. Leadership is denoted by a ZNode. Curator has a top-level watcher that notifies on cluster instability. How does the fence make this better? -JZ On Dec 8, 2012, at 9:30 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > On 8 December 2012 21:18, Jordan Zimmerman <[EMAIL PROTECTED]>wrote: > >> If your ConnectionStateListener gets SUSPENDED or LOST you've lost >> connection to ZooKeeper. Therefore you cannot use that same ZooKeeper >> connection to manage a node that denotes the process is running or not. >> Only 1 VM at a time will be running the process. That process can watch for >> SUSPENDED/LOST and wind down the task. >> >> > My point is that by the time that VM sees SUSPENDED/LOST, another VM may > have been elected leader and have started running another process. > > It's a classic problem - you need some mechanism to fence a node that > thinks its the leader, but isn't and hasn't got the memo yet. The way > around the problem is to either ensure that no work is done by you once you > are no longer the leader (perhaps by checking every time you want to do > work), or that the work you do does not affect the system (e.g. by > idempotent work units). > > ZK itself solves this internally by checking with that it has a quorum for > every operation, which forces an ordering between the disconnection event > and trying to do something that relies upon being the leader. Other systems > forcibly terminate old leaders before allowing a new leader to take the > throne. > > Henry > > >>> You can't assume that the notification is received locally before another >>> leader election finishes elsewhere >> Which notification? The ConnectionStateListener is an abstraction on >> ZooKeeper's watcher mechanism. It's only significant for the VM that is the >> leader. Non-leaders don't need to be concerned. > > >> -JZ >> >> On Dec 8, 2012, at 9:12 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: >> >>> You can't assume that the notification is received locally before another >>> leader election finishes elsewhere (particularly if you are running >> slowly >>> for some reason!), so it's not sufficient to guarantee that the process >>> that is running locally has finished before someone else starts another. >>> >>> It's usually best - if possible - to restructure the system so that >>> processes are idempotent to work around these kinds of problem, in >>> conjunction with using the kind of primitives that Curator provides. >>> >>> Henry >>> >>> On 8 December 2012 21:04, Jordan Zimmerman <[EMAIL PROTECTED] >>> wrote: >>> >>>> This is why you need a ConnectionStateListener. You'll get a notice that >>>> the connection has been suspended and you should assume all >> locks/leaders >>>> are invalid. >>>> >>>> -JZ >>>> >>>> On Dec 8, 2012, at 9:02 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: >>>> >>>>> What about a network disconnection? Presumably leadership is revoked >> when >>>>> the leader appears to have failed, which can be for more reasons than a >>>> VM >>>>> crash (VM running slow, network event, GC pause etc). >>>>> >>>>> Henry >>>>> >>>>> On 8 December 2012 21:00, Jordan Zimmerman <[EMAIL PROTECTED] >>>>> wrote: >>>>> >>>>>> The leader latch lock is the equivalent of task in progress. I assume >>>> the >>>>>> task is running in the same VM as the leader lock. The only reason the >>>> VM >>>>>> would lose leadership is if it crashes in which case the process would >>>> die >>>>>> anyway. >>>>>> >>>>>> -JZ >>>>>> >>>>>> On Dec 8, 2012, at 8:56 PM, Eric Pederson <[EMAIL PROTECTED]> wrote:
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 21:42
The irony is that I am using leader election to convert non-idempotent
operations into idempotent operations :) For example, once a night a report is emailed out to a set of addresses. We don't want the report to go to the same person more than once. Prior to using leader election one of the cluster members was designated as the scheduled task "leader" using a system property. But if that cluster member crashed it required a manual operation to failover the "leader" responsibility to another cluster member. I considered using app-specific techniques to make the scheduled tasks idempotent (for example using "select for update" / database locking) but I wanted a general solution and I needed clustering support for other reasons (cluster membership, etc). Anyway, here is the code that I'm using. Application startup (using Curator LeaderLatch): https://gist.github.com/3936162 https://gist.github.com/3935895 https://gist.github.com/3935889 ClusterStatus: https://gist.github.com/3943149 https://gist.github.com/3935861 Scheduled task: https://gist.github.com/4246388 In the last gist the "distribute" scheduled task is run every 30 seconds. It checks clusterStatus.isLeader to see if the current cluster member is the leader before running the real method (which sends email). clusterStatus() calls methods on LeaderLatch. Here is the output that I am seeing if I kill the ZK quorum leader and the app cluster member that was the leader loses its LeaderLatch leadership to another cluster member: https://gist.github.com/4247058 -- Eric On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <[EMAIL PROTECTED]> wrote: > On 8 December 2012 21:18, Jordan Zimmerman <[EMAIL PROTECTED] > >wrote: > > > If your ConnectionStateListener gets SUSPENDED or LOST you've lost > > connection to ZooKeeper. Therefore you cannot use that same ZooKeeper > > connection to manage a node that denotes the process is running or not. > > Only 1 VM at a time will be running the process. That process can watch > for > > SUSPENDED/LOST and wind down the task. > > > > > My point is that by the time that VM sees SUSPENDED/LOST, another VM may > have been elected leader and have started running another process. > > It's a classic problem - you need some mechanism to fence a node that > thinks its the leader, but isn't and hasn't got the memo yet. The way > around the problem is to either ensure that no work is done by you once you > are no longer the leader (perhaps by checking every time you want to do > work), or that the work you do does not affect the system (e.g. by > idempotent work units). > > ZK itself solves this internally by checking with that it has a quorum for > every operation, which forces an ordering between the disconnection event > and trying to do something that relies upon being the leader. Other systems > forcibly terminate old leaders before allowing a new leader to take the > throne. > > Henry > > > > > You can't assume that the notification is received locally before > another > > > leader election finishes elsewhere > > Which notification? The ConnectionStateListener is an abstraction on > > ZooKeeper's watcher mechanism. It's only significant for the VM that is > the > > leader. Non-leaders don't need to be concerned. > > > > -JZ > > > > On Dec 8, 2012, at 9:12 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: > > > > > You can't assume that the notification is received locally before > another > > > leader election finishes elsewhere (particularly if you are running > > slowly > > > for some reason!), so it's not sufficient to guarantee that the process > > > that is running locally has finished before someone else starts > another. > > > > > > It's usually best - if possible - to restructure the system so that > > > processes are idempotent to work around these kinds of problem, in > > > conjunction with using the kind of primitives that Curator provides. > > > > > > Henry > > > > > > On 8 December 2012 21:04, Jordan Zimmerman <[EMAIL PROTECTED]
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-09, 22:10
Also sometimes the app leadership (via LeaderLatch) will get lost - I will
follow up about this on the Curator list: https://gist.github.com/4247226 So back to my previous question, what is the best way to implement the "fence"? -- Eric On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > The irony is that I am using leader election to convert non-idempotent > operations into idempotent operations :) For example, once a night a > report is emailed out to a set of addresses. We don't want the report to > go to the same person more than once. > > Prior to using leader election one of the cluster members was designated > as the scheduled task "leader" using a system property. But if that > cluster member crashed it required a manual operation to failover the > "leader" responsibility to another cluster member. I considered using > app-specific techniques to make the scheduled tasks idempotent (for example > using "select for update" / database locking) but I wanted a general > solution and I needed clustering support for other reasons (cluster > membership, etc). > > Anyway, here is the code that I'm using. > > Application startup (using Curator LeaderLatch): > https://gist.github.com/3936162 > https://gist.github.com/3935895 > https://gist.github.com/3935889 > > ClusterStatus: > https://gist.github.com/3943149 > https://gist.github.com/3935861 > > Scheduled task: > https://gist.github.com/4246388 > > In the last gist the "distribute" scheduled task is run every 30 seconds. > It checks clusterStatus.isLeader to see if the current cluster member is > the leader before running the real method (which sends email). > clusterStatus() calls methods on LeaderLatch. > > Here is the output that I am seeing if I kill the ZK quorum leader and the > app cluster member that was the leader loses its LeaderLatch leadership to > another cluster member: > https://gist.github.com/4247058 > > > -- Eric > > > > On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <[EMAIL PROTECTED]>wrote: > >> On 8 December 2012 21:18, Jordan Zimmerman <[EMAIL PROTECTED] >> >wrote: >> >> > If your ConnectionStateListener gets SUSPENDED or LOST you've lost >> > connection to ZooKeeper. Therefore you cannot use that same ZooKeeper >> > connection to manage a node that denotes the process is running or not. >> > Only 1 VM at a time will be running the process. That process can watch >> for >> > SUSPENDED/LOST and wind down the task. >> > >> > >> My point is that by the time that VM sees SUSPENDED/LOST, another VM may >> have been elected leader and have started running another process. >> >> It's a classic problem - you need some mechanism to fence a node that >> thinks its the leader, but isn't and hasn't got the memo yet. The way >> around the problem is to either ensure that no work is done by you once >> you >> are no longer the leader (perhaps by checking every time you want to do >> work), or that the work you do does not affect the system (e.g. by >> idempotent work units). >> >> ZK itself solves this internally by checking with that it has a quorum for >> every operation, which forces an ordering between the disconnection event >> and trying to do something that relies upon being the leader. Other >> systems >> forcibly terminate old leaders before allowing a new leader to take the >> throne. >> >> Henry >> >> >> > > You can't assume that the notification is received locally before >> another >> > > leader election finishes elsewhere >> > Which notification? The ConnectionStateListener is an abstraction on >> > ZooKeeper's watcher mechanism. It's only significant for the VM that is >> the >> > leader. Non-leaders don't need to be concerned. >> >> >> > -JZ >> > >> > On Dec 8, 2012, at 9:12 PM, Henry Robinson <[EMAIL PROTECTED]> wrote: >> > >> > > You can't assume that the notification is received locally before >> another >> > > leader election finishes elsewhere (particularly if you are running >> > slowly >> > > for some reason!), so it's not sufficient to guarantee that the
-
Re: leader election, scheduled tasks, losing leadershipVitalii Tymchyshyn 2012-12-10, 06:49
How are you going to ensure atomicity? I mean, if you processor dies in the
middle of the operation, how do you know if it is done or not? -- Best regards, Vitalii Tymchyshyn 10 груд. 2012 00:11, "Eric Pederson" <[EMAIL PROTECTED]> напис. > Also sometimes the app leadership (via LeaderLatch) will get lost - I will > follow up about this on the Curator list: > https://gist.github.com/4247226 > > So back to my previous question, what is the best way to implement the > "fence"? > > -- Eric > > > > On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > The irony is that I am using leader election to convert non-idempotent > > operations into idempotent operations :) For example, once a night a > > report is emailed out to a set of addresses. We don't want the report > to > > go to the same person more than once. > > > > Prior to using leader election one of the cluster members was designated > > as the scheduled task "leader" using a system property. But if that > > cluster member crashed it required a manual operation to failover the > > "leader" responsibility to another cluster member. I considered using > > app-specific techniques to make the scheduled tasks idempotent (for > example > > using "select for update" / database locking) but I wanted a general > > solution and I needed clustering support for other reasons (cluster > > membership, etc). > > > > Anyway, here is the code that I'm using. > > > > Application startup (using Curator LeaderLatch): > > https://gist.github.com/3936162 > > https://gist.github.com/3935895 > > https://gist.github.com/3935889 > > > > ClusterStatus: > > https://gist.github.com/3943149 > > https://gist.github.com/3935861 > > > > Scheduled task: > > https://gist.github.com/4246388 > > > > In the last gist the "distribute" scheduled task is run every 30 seconds. > > It checks clusterStatus.isLeader to see if the current cluster member > is > > the leader before running the real method (which sends email). > > clusterStatus() calls methods on LeaderLatch. > > > > Here is the output that I am seeing if I kill the ZK quorum leader and > the > > app cluster member that was the leader loses its LeaderLatch leadership > to > > another cluster member: > > https://gist.github.com/4247058 > > > > > > -- Eric > > > > > > > > On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <[EMAIL PROTECTED] > >wrote: > > > >> On 8 December 2012 21:18, Jordan Zimmerman <[EMAIL PROTECTED] > >> >wrote: > >> > >> > If your ConnectionStateListener gets SUSPENDED or LOST you've lost > >> > connection to ZooKeeper. Therefore you cannot use that same ZooKeeper > >> > connection to manage a node that denotes the process is running or > not. > >> > Only 1 VM at a time will be running the process. That process can > watch > >> for > >> > SUSPENDED/LOST and wind down the task. > >> > > >> > > >> My point is that by the time that VM sees SUSPENDED/LOST, another VM may > >> have been elected leader and have started running another process. > >> > >> It's a classic problem - you need some mechanism to fence a node that > >> thinks its the leader, but isn't and hasn't got the memo yet. The way > >> around the problem is to either ensure that no work is done by you once > >> you > >> are no longer the leader (perhaps by checking every time you want to do > >> work), or that the work you do does not affect the system (e.g. by > >> idempotent work units). > >> > >> ZK itself solves this internally by checking with that it has a quorum > for > >> every operation, which forces an ordering between the disconnection > event > >> and trying to do something that relies upon being the leader. Other > >> systems > >> forcibly terminate old leaders before allowing a new leader to take the > >> throne. > >> > >> Henry > >> > >> > >> > > You can't assume that the notification is received locally before > >> another > >> > > leader election finishes elsewhere > >> > Which notification? The ConnectionStateListener is an abstraction on
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-10, 11:52
It depends on the scheduled task. Some have status fields in the database
that indicate new/in-progress/done, but others do not. -- Eric On Mon, Dec 10, 2012 at 1:49 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]>wrote: > How are you going to ensure atomicity? I mean, if you processor dies in the > middle of the operation, how do you know if it is done or not? > > -- > Best regards, > Vitalii Tymchyshyn > 10 груд. 2012 00:11, "Eric Pederson" <[EMAIL PROTECTED]> напис. > > > Also sometimes the app leadership (via LeaderLatch) will get lost - I > will > > follow up about this on the Curator list: > > https://gist.github.com/4247226 > > > > So back to my previous question, what is the best way to implement the > > "fence"? > > > > -- Eric > > > > > > > > On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[EMAIL PROTECTED]> wrote: > > > > > The irony is that I am using leader election to convert non-idempotent > > > operations into idempotent operations :) For example, once a night a > > > report is emailed out to a set of addresses. We don't want the report > > to > > > go to the same person more than once. > > > > > > Prior to using leader election one of the cluster members was > designated > > > as the scheduled task "leader" using a system property. But if that > > > cluster member crashed it required a manual operation to failover the > > > "leader" responsibility to another cluster member. I considered using > > > app-specific techniques to make the scheduled tasks idempotent (for > > example > > > using "select for update" / database locking) but I wanted a general > > > solution and I needed clustering support for other reasons (cluster > > > membership, etc). > > > > > > Anyway, here is the code that I'm using. > > > > > > Application startup (using Curator LeaderLatch): > > > https://gist.github.com/3936162 > > > https://gist.github.com/3935895 > > > https://gist.github.com/3935889 > > > > > > ClusterStatus: > > > https://gist.github.com/3943149 > > > https://gist.github.com/3935861 > > > > > > Scheduled task: > > > https://gist.github.com/4246388 > > > > > > In the last gist the "distribute" scheduled task is run every 30 > seconds. > > > It checks clusterStatus.isLeader to see if the current cluster member > > is > > > the leader before running the real method (which sends email). > > > clusterStatus() calls methods on LeaderLatch. > > > > > > Here is the output that I am seeing if I kill the ZK quorum leader and > > the > > > app cluster member that was the leader loses its LeaderLatch leadership > > to > > > another cluster member: > > > https://gist.github.com/4247058 > > > > > > > > > -- Eric > > > > > > > > > > > > On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <[EMAIL PROTECTED] > > >wrote: > > > > > >> On 8 December 2012 21:18, Jordan Zimmerman < > [EMAIL PROTECTED] > > >> >wrote: > > >> > > >> > If your ConnectionStateListener gets SUSPENDED or LOST you've lost > > >> > connection to ZooKeeper. Therefore you cannot use that same > ZooKeeper > > >> > connection to manage a node that denotes the process is running or > > not. > > >> > Only 1 VM at a time will be running the process. That process can > > watch > > >> for > > >> > SUSPENDED/LOST and wind down the task. > > >> > > > >> > > > >> My point is that by the time that VM sees SUSPENDED/LOST, another VM > may > > >> have been elected leader and have started running another process. > > >> > > >> It's a classic problem - you need some mechanism to fence a node that > > >> thinks its the leader, but isn't and hasn't got the memo yet. The way > > >> around the problem is to either ensure that no work is done by you > once > > >> you > > >> are no longer the leader (perhaps by checking every time you want to > do > > >> work), or that the work you do does not affect the system (e.g. by > > >> idempotent work units). > > >> > > >> ZK itself solves this internally by checking with that it has a quorum > > for > > >> every operation, which forces an ordering between the disconnection
-
Re: leader election, scheduled tasks, losing leadershipVitalii Tymchyshyn 2012-12-11, 20:09
I am asking because you have this "at most once" vs "at least one" problem.
I don't think you can have "exactly one" unless your jobs are transactional and you can synhronize your transaction commits to zookeeper (and better with two-phase commit). So, you need to decide What I'd recommend to you is to make queue-like architecture, not lock-based. This way you can: a) Do parallel task processing b) Try increasing timeouts to be larger than maximum task time. E.g. set it to one hour. This would mean that task running will restart in an hour if client fails. But this would mean moving from database to zookeeper for task status/queueing. As for me this would be good as database is SPOF for you. Best regards, Vitalii Tymchyshyn 2012/12/10 Eric Pederson <[EMAIL PROTECTED]> > It depends on the scheduled task. Some have status fields in the database > that indicate new/in-progress/done, but others do not. > > > -- Eric > > > > On Mon, Dec 10, 2012 at 1:49 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED] > >wrote: > > > How are you going to ensure atomicity? I mean, if you processor dies in > the > > middle of the operation, how do you know if it is done or not? > > > > -- > > Best regards, > > Vitalii Tymchyshyn > > 10 груд. 2012 00:11, "Eric Pederson" <[EMAIL PROTECTED]> напис. > > > > > Also sometimes the app leadership (via LeaderLatch) will get lost - I > > will > > > follow up about this on the Curator list: > > > https://gist.github.com/4247226 > > > > > > So back to my previous question, what is the best way to implement the > > > "fence"? > > > > > > -- Eric > > > > > > > > > > > > On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[EMAIL PROTECTED]> > wrote: > > > > > > > The irony is that I am using leader election to convert > non-idempotent > > > > operations into idempotent operations :) For example, once a night > a > > > > report is emailed out to a set of addresses. We don't want the > report > > > to > > > > go to the same person more than once. > > > > > > > > Prior to using leader election one of the cluster members was > > designated > > > > as the scheduled task "leader" using a system property. But if that > > > > cluster member crashed it required a manual operation to failover the > > > > "leader" responsibility to another cluster member. I considered > using > > > > app-specific techniques to make the scheduled tasks idempotent (for > > > example > > > > using "select for update" / database locking) but I wanted a general > > > > solution and I needed clustering support for other reasons (cluster > > > > membership, etc). > > > > > > > > Anyway, here is the code that I'm using. > > > > > > > > Application startup (using Curator LeaderLatch): > > > > https://gist.github.com/3936162 > > > > https://gist.github.com/3935895 > > > > https://gist.github.com/3935889 > > > > > > > > ClusterStatus: > > > > https://gist.github.com/3943149 > > > > https://gist.github.com/3935861 > > > > > > > > Scheduled task: > > > > https://gist.github.com/4246388 > > > > > > > > In the last gist the "distribute" scheduled task is run every 30 > > seconds. > > > > It checks clusterStatus.isLeader to see if the current cluster > member > > > is > > > > the leader before running the real method (which sends email). > > > > clusterStatus() calls methods on LeaderLatch. > > > > > > > > Here is the output that I am seeing if I kill the ZK quorum leader > and > > > the > > > > app cluster member that was the leader loses its LeaderLatch > leadership > > > to > > > > another cluster member: > > > > https://gist.github.com/4247058 > > > > > > > > > > > > -- Eric > > > > > > > > > > > > > > > > On Sun, Dec 9, 2012 at 12:30 AM, Henry Robinson <[EMAIL PROTECTED] > > > >wrote: > > > > > > > >> On 8 December 2012 21:18, Jordan Zimmerman < > > [EMAIL PROTECTED] > > > >> >wrote: > > > >> > > > >> > If your ConnectionStateListener gets SUSPENDED or LOST you've lost > > > >> > connection to ZooKeeper. Therefore you cannot use that same Best regards, Vitalii Tymchyshyn
-
Re: leader election, scheduled tasks, losing leadershipEric Pederson 2012-12-12, 00:54
Thanks Vitalii! I will think about this and ask if I have any questions.
-- Eric On Tue, Dec 11, 2012 at 3:09 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]>wrote: > I am asking because you have this "at most once" vs "at least one" problem. > I don't think you can have "exactly one" unless your jobs are transactional > and you can synhronize your transaction commits to zookeeper (and better > with two-phase commit). So, you need to decide > > What I'd recommend to you is to make queue-like architecture, not > lock-based. This way you can: > a) Do parallel task processing > b) Try increasing timeouts to be larger than maximum task time. > E.g. set it to one hour. This would mean that task running will restart > in an hour if client fails. > > But this would mean moving from database to zookeeper for task > status/queueing. As for me this would be good as database is SPOF for you. > > Best regards, Vitalii Tymchyshyn > > > 2012/12/10 Eric Pederson <[EMAIL PROTECTED]> > > > It depends on the scheduled task. Some have status fields in the > database > > that indicate new/in-progress/done, but others do not. > > > > > > -- Eric > > > > > > > > On Mon, Dec 10, 2012 at 1:49 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED] > > >wrote: > > > > > How are you going to ensure atomicity? I mean, if you processor dies in > > the > > > middle of the operation, how do you know if it is done or not? > > > > > > -- > > > Best regards, > > > Vitalii Tymchyshyn > > > 10 груд. 2012 00:11, "Eric Pederson" <[EMAIL PROTECTED]> напис. > > > > > > > Also sometimes the app leadership (via LeaderLatch) will get lost - I > > > will > > > > follow up about this on the Curator list: > > > > https://gist.github.com/4247226 > > > > > > > > So back to my previous question, what is the best way to implement > the > > > > "fence"? > > > > > > > > -- Eric > > > > > > > > > > > > > > > > On Sun, Dec 9, 2012 at 4:42 PM, Eric Pederson <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > The irony is that I am using leader election to convert > > non-idempotent > > > > > operations into idempotent operations :) For example, once a > night > > a > > > > > report is emailed out to a set of addresses. We don't want the > > report > > > > to > > > > > go to the same person more than once. > > > > > > > > > > Prior to using leader election one of the cluster members was > > > designated > > > > > as the scheduled task "leader" using a system property. But if > that > > > > > cluster member crashed it required a manual operation to failover > the > > > > > "leader" responsibility to another cluster member. I considered > > using > > > > > app-specific techniques to make the scheduled tasks idempotent (for > > > > example > > > > > using "select for update" / database locking) but I wanted a > general > > > > > solution and I needed clustering support for other reasons (cluster > > > > > membership, etc). > > > > > > > > > > Anyway, here is the code that I'm using. > > > > > > > > > > Application startup (using Curator LeaderLatch): > > > > > https://gist.github.com/3936162 > > > > > https://gist.github.com/3935895 > > > > > https://gist.github.com/3935889 > > > > > > > > > > ClusterStatus: > > > > > https://gist.github.com/3943149 > > > > > https://gist.github.com/3935861 > > > > > > > > > > Scheduled task: > > > > > https://gist.github.com/4246388 > > > > > > > > > > In the last gist the "distribute" scheduled task is run every 30 > > > seconds. > > > > > It checks clusterStatus.isLeader to see if the current cluster > > member > > > > is > > > > > the leader before running the real method (which sends email). > > > > > clusterStatus() calls methods on LeaderLatch. > > > > > > > > > > Here is the output that I am seeing if I kill the ZK quorum leader > > and > > > > the > > > > > app cluster member that was the leader loses its LeaderLatch > > leadership > > > > to > > > > > another cluster member: > > > > > https://gist.github.com/4247058 > > > > > > > > > > > > > > > -- Eric |