|
Zhao Boran
2013-01-11, 13:46
Andrey Stepachev
2013-01-11, 14:48
Hulunbier
2013-01-11, 16:10
Jordan Zimmerman
2013-01-11, 20:20
Hulunbier
2013-01-12, 10:30
Ben Bangert
2013-01-12, 17:39
Jordan Zimmerman
2013-01-13, 01:31
Hulunbier
2013-01-13, 15:05
Vitalii Tymchyshyn
2013-01-14, 10:37
Hulunbier
2013-01-14, 15:06
Vitalii Tymchyshyn
2013-01-14, 15:38
Ted Dunning
2013-01-14, 16:05
Hulunbier
2013-01-15, 02:28
Hulunbier
2013-01-15, 01:52
Jordan Zimmerman
2013-01-15, 02:23
Hulunbier
2013-01-15, 03:45
Benjamin Reed
2013-01-15, 05:27
Hulunbier
2013-01-15, 06:32
Ted Dunning
2013-01-17, 11:43
Hulunbier
2013-01-18, 08:26
Benjamin Reed
2013-01-17, 04:28
Hulunbier
2013-01-17, 09:05
Vitalii Tymchyshyn
2013-01-27, 19:29
Hulunbier
2013-01-13, 14:40
|
-
Getting confused with the "recipe for lock"Zhao Boran 2013-01-11, 13:46
While reading the zookeeper's recipe for
lock<http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks>, I get confused: Seems that this recipe-for-distributed-lock can not guarantee *"any snapshot in time no two clients think they hold the same lock"*. But since zookeeper is so widely adopted, if there were such mistakes in the reference doc, someone should have pointed it out long time ago. So, what did I misunderstand? please help me! Recipe-for-distributed-lock (from http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks) Locks Fully distributed locks that are globally synchronous, *meaning at any snapshot in time no two clients think they hold the same lock*. These can be implemented using ZooKeeeper. As with priority queues, first define a lock node. 1. Call create( ) with a pathname of "*locknode*/guid-lock-" and the sequence and ephemeral flags set. 2. Call getChildren( ) on the lock node without setting the watch flag (this is important to avoid the herd effect). 3. If the pathname created in step 1 has the lowest sequence number suffix, the client has the lock and the client exits the protocol. 4. The client calls exists( ) with the watch flag set on the path in the lock directory with the next lowest sequence number. 5. if exists( ) returns false, go to step 2. Otherwise, wait for a notification for the pathname from the previous step before going to step 2. Considering the following case: - Client1 successfully acquired the lock(in step3), with zk node "locknode/guid-lock-0"; - Client2 created node "locknode/guid-lock-1", failed to acquire the lock, and watching "locknode/guid-lock-0"; - Later, for some reasons(network congestion?), client1 failed to send heart beat message to zk cluster on time, but client1 is still perfectly working, and assuming itself still holding the lock. - But, Zookeeper may think client1's session is timeouted, and then 1. deletes "locknode/guid-lock-0" 2. sends a notification to Client2 (or send the notification first?) 3. but can not send "session timeout" notification to client1 in time (due to network congestion?) - Client2 got the notification, goes to step 2, gets the only node ""locknode/guid-lock-1", which is created by itself; thus, client2 assumes it hold the lock. - But at the same time, client1 assumes it hold the lock. Is this a valid scenario? Thanks a lot! +
Zhao Boran 2013-01-11, 13:46
-
Re: Getting confused with the "recipe for lock"Andrey Stepachev 2013-01-11, 14:48
Hi,
Yes, this scenario is very likely. But it will work only for long running tasks (more then session timeout), for short livinig tasks lock will be unlocked before session timeout, surely. In case of long living locks, Client1 should track disconnection from zk cluster and assume, that lock was abandoned (and somehow notify lock owner about that). Client can know value of session timeout and spawn timer, and action accordingly program logic. As example it can interrupt thread, which created lock, and rise some flag, so long running task can know - lock is not valid. On Fri, Jan 11, 2013 at 5:46 PM, Zhao Boran <[EMAIL PROTECTED]> wrote: > While reading the zookeeper's recipe for > lock<http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks>, > I get confused: > > Seems that this recipe-for-distributed-lock can not guarantee *"any > snapshot in time no two clients think they hold the same lock"*. > > But since zookeeper is so widely adopted, if there were such mistakes in > the reference doc, someone should have pointed it out long time ago. > > So, what did I misunderstand? please help me! > > Recipe-for-distributed-lock (from > http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks) > > Locks > > Fully distributed locks that are globally synchronous, *meaning at any > snapshot in time no two clients think they hold the same lock*. These can > be implemented using ZooKeeeper. As with priority queues, first define a > lock node. > > 1. Call create( ) with a pathname of "*locknode*/guid-lock-" and the > sequence and ephemeral flags set. > 2. Call getChildren( ) on the lock node without setting the watch flag > (this is important to avoid the herd effect). > 3. If the pathname created in step 1 has the lowest sequence number > suffix, the client has the lock and the client exits the protocol. > 4. The client calls exists( ) with the watch flag set on the path in the > lock directory with the next lowest sequence number. > 5. if exists( ) returns false, go to step 2. Otherwise, wait for a > notification for the pathname from the previous step before going to > step 2. > > Considering the following case: > > - > > Client1 successfully acquired the lock(in step3), with zk node > "locknode/guid-lock-0"; > - > > Client2 created node "locknode/guid-lock-1", failed to acquire the lock, > and watching "locknode/guid-lock-0"; > - > > Later, for some reasons(network congestion?), client1 failed to send > heart beat message to zk cluster on time, but client1 is still perfectly > working, and assuming itself still holding the lock. > - > > But, Zookeeper may think client1's session is timeouted, and then > 1. deletes "locknode/guid-lock-0" > 2. sends a notification to Client2 (or send the notification first?) > 3. but can not send "session timeout" notification to client1 in time > (due to network congestion?) > > > - > > Client2 got the notification, goes to step 2, gets the only node > ""locknode/guid-lock-1", which is created by itself; thus, client2 > assumes > it hold the lock. > - > > But at the same time, client1 assumes it hold the lock. > > Is this a valid scenario? > > Thanks a lot! > -- Andrey. +
Andrey Stepachev 2013-01-11, 14:48
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-11, 16:10
Thanks!
I do agree with you that Client1 will eventually know that the lock is invalid, by tracking disconnection and time. But, 1. Time can not by precisely synchronized between servers; it is likely that client1 will detect session timeout (by its timer thread), after server treats client1's session as timeouted and Client2 thinks itself holding the lock. so, within a small time gap, more than one client may believe themselves holding the lock. 2. thus , the protocol of lock can still not guarantee exclusiveness; is it ... er... broken ? On Fri, Jan 11, 2013 at 10:48 PM, Andrey Stepachev <[EMAIL PROTECTED]> wrote: > Hi, > > Yes, this scenario is very likely. > But it will work only for long running tasks (more then session timeout), > for short livinig tasks lock will be unlocked before session timeout, > surely. > > In case of long living locks, Client1 should track disconnection from zk > cluster and assume, that lock was abandoned (and somehow notify lock owner > about that). Client can know value of session timeout and spawn timer, and > action accordingly program logic. As example it can interrupt thread, which > created lock, and rise some flag, so long running task can know - lock is > not valid. > > > On Fri, Jan 11, 2013 at 5:46 PM, Zhao Boran <[EMAIL PROTECTED]> wrote: > > > While reading the zookeeper's recipe for > > lock<http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks > >, > > I get confused: > > > > Seems that this recipe-for-distributed-lock can not guarantee *"any > > snapshot in time no two clients think they hold the same lock"*. > > > > But since zookeeper is so widely adopted, if there were such mistakes in > > the reference doc, someone should have pointed it out long time ago. > > > > So, what did I misunderstand? please help me! > > > > Recipe-for-distributed-lock (from > > http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks) > > > > Locks > > > > Fully distributed locks that are globally synchronous, *meaning at any > > snapshot in time no two clients think they hold the same lock*. These can > > be implemented using ZooKeeeper. As with priority queues, first define a > > lock node. > > > > 1. Call create( ) with a pathname of "*locknode*/guid-lock-" and the > > sequence and ephemeral flags set. > > 2. Call getChildren( ) on the lock node without setting the watch flag > > (this is important to avoid the herd effect). > > 3. If the pathname created in step 1 has the lowest sequence number > > suffix, the client has the lock and the client exits the protocol. > > 4. The client calls exists( ) with the watch flag set on the path in > the > > lock directory with the next lowest sequence number. > > 5. if exists( ) returns false, go to step 2. Otherwise, wait for a > > notification for the pathname from the previous step before going to > > step 2. > > > > Considering the following case: > > > > - > > > > Client1 successfully acquired the lock(in step3), with zk node > > "locknode/guid-lock-0"; > > - > > > > Client2 created node "locknode/guid-lock-1", failed to acquire the > lock, > > and watching "locknode/guid-lock-0"; > > - > > > > Later, for some reasons(network congestion?), client1 failed to send > > heart beat message to zk cluster on time, but client1 is still > perfectly > > working, and assuming itself still holding the lock. > > - > > > > But, Zookeeper may think client1's session is timeouted, and then > > 1. deletes "locknode/guid-lock-0" > > 2. sends a notification to Client2 (or send the notification > first?) > > 3. but can not send "session timeout" notification to client1 in > time > > (due to network congestion?) > > > > > > - > > > > Client2 got the notification, goes to step 2, gets the only node > > ""locknode/guid-lock-1", which is created by itself; thus, client2 > > assumes > > it hold the lock. > > - > > > > But at the same time, client1 assumes it hold the lock. +
Hulunbier 2013-01-11, 16:10
-
Re: Getting confused with the "recipe for lock"Jordan Zimmerman 2013-01-11, 20:20
If client1's hearbeat fails its main watcher will get a Disconnect event. Well behaving ZK applications must watch for this and assume that it no longer holds the lock and, thus, should delete its node. If client1 needs the lock again it should try to re-acquire it from step 1 of the recipe. Further, well behaving ZK applications must re-try node deletes if there is a connection problem. Have a look at Curator's implementation for details.
-JZ On Jan 11, 2013, at 5:46 AM, Zhao Boran <[EMAIL PROTECTED]> wrote: > While reading the zookeeper's recipe for > lock<http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks>, > I get confused: > > Seems that this recipe-for-distributed-lock can not guarantee *"any > snapshot in time no two clients think they hold the same lock"*. > > But since zookeeper is so widely adopted, if there were such mistakes in > the reference doc, someone should have pointed it out long time ago. > > So, what did I misunderstand? please help me! > > Recipe-for-distributed-lock (from > http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks) > > Locks > > Fully distributed locks that are globally synchronous, *meaning at any > snapshot in time no two clients think they hold the same lock*. These can > be implemented using ZooKeeeper. As with priority queues, first define a > lock node. > > 1. Call create( ) with a pathname of "*locknode*/guid-lock-" and the > sequence and ephemeral flags set. > 2. Call getChildren( ) on the lock node without setting the watch flag > (this is important to avoid the herd effect). > 3. If the pathname created in step 1 has the lowest sequence number > suffix, the client has the lock and the client exits the protocol. > 4. The client calls exists( ) with the watch flag set on the path in the > lock directory with the next lowest sequence number. > 5. if exists( ) returns false, go to step 2. Otherwise, wait for a > notification for the pathname from the previous step before going to step 2. > > Considering the following case: > > - > > Client1 successfully acquired the lock(in step3), with zk node > "locknode/guid-lock-0"; > - > > Client2 created node "locknode/guid-lock-1", failed to acquire the lock, > and watching "locknode/guid-lock-0"; > - > > Later, for some reasons(network congestion?), client1 failed to send > heart beat message to zk cluster on time, but client1 is still perfectly > working, and assuming itself still holding the lock. > - > > But, Zookeeper may think client1's session is timeouted, and then > 1. deletes "locknode/guid-lock-0" > 2. sends a notification to Client2 (or send the notification first?) > 3. but can not send "session timeout" notification to client1 in time > (due to network congestion?) > > > - > > Client2 got the notification, goes to step 2, gets the only node > ""locknode/guid-lock-1", which is created by itself; thus, client2 assumes > it hold the lock. > - > > But at the same time, client1 assumes it hold the lock. > > Is this a valid scenario? > > Thanks a lot! +
Jordan Zimmerman 2013-01-11, 20:20
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-12, 10:30
Thanks Jordan,
> If client1's hearbeat fails its main watcher will get a Disconnect event Suppose the network link betweens client1 and server is at very low quality (high packet loss rate?) but still fully functional. Client1 may be happily sending heart-beat-messages to server without notice anything; but ZK server could be unable to receive heart-beat-messages from client1 for a long period of time , which leads ZK server to timeout client1's session, and delete the ephemeral node. Thus, client's session could be timeouted by ZK server, without triggering a Disconnect event. >Well behaving ZK applications must watch for this and assume that it no longer holds the lock and, thus, should delete its node. If client1 needs the lock again it should try to re-acquire it from step 1 of the recipe. Further, well behaving ZK applications must re-try node deletes if there is a connection problem. Have a look at Curator's implementation for details. Thanks for pointing me the "Curator's implementation", I will dig into the source code. But I still feels that, no matter how well a ZK application behaves, if we use ephemeral node in the lock-recipe; we can not guarantee "at any snapshot in time no two clients think they hold the same lock", which is the fundamental requirement/constraint for a lock. Mr. Andrey Stepachev suggested that I should use a timer in client side to track session_timeout, that sounds reasonable; but I think this implicitly implies some constrains of clock drift - which I am not expected in a solution based on Zookeeper (ZK is supposed to keep the animals well). On Sat, Jan 12, 2013 at 4:20 AM, Jordan Zimmerman <[EMAIL PROTECTED]> wrote: > > If client1's hearbeat fails its main watcher will get a Disconnect event. Well behaving ZK applications must watch for this and assume that it no longer holds the lock and, thus, should delete its node. If client1 needs the lock again it should try to re-acquire it from step 1 of the recipe. Further, well behaving ZK applications must re-try node deletes if there is a connection problem. Have a look at Curator's implementation for details. > > -JZ > > On Jan 11, 2013, at 5:46 AM, Zhao Boran <[EMAIL PROTECTED]> wrote: > > > While reading the zookeeper's recipe for > > lock<http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks>, > > I get confused: > > > > Seems that this recipe-for-distributed-lock can not guarantee *"any > > snapshot in time no two clients think they hold the same lock"*. > > > > But since zookeeper is so widely adopted, if there were such mistakes in > > the reference doc, someone should have pointed it out long time ago. > > > > So, what did I misunderstand? please help me! > > > > Recipe-for-distributed-lock (from > > http://zookeeper.apache.org/doc/trunk/recipes.html#sc_recipes_Locks) > > > > Locks > > > > Fully distributed locks that are globally synchronous, *meaning at any > > snapshot in time no two clients think they hold the same lock*. These can > > be implemented using ZooKeeeper. As with priority queues, first define a > > lock node. > > > > 1. Call create( ) with a pathname of "*locknode*/guid-lock-" and the > > sequence and ephemeral flags set. > > 2. Call getChildren( ) on the lock node without setting the watch flag > > (this is important to avoid the herd effect). > > 3. If the pathname created in step 1 has the lowest sequence number > > suffix, the client has the lock and the client exits the protocol. > > 4. The client calls exists( ) with the watch flag set on the path in the > > lock directory with the next lowest sequence number. > > 5. if exists( ) returns false, go to step 2. Otherwise, wait for a > > notification for the pathname from the previous step before going to step 2. > > > > Considering the following case: > > > > - > > > > Client1 successfully acquired the lock(in step3), with zk node > > "locknode/guid-lock-0"; > > - > > > > Client2 created node "locknode/guid-lock-1", failed to acquire the lock, +
Hulunbier 2013-01-12, 10:30
-
Re: Getting confused with the "recipe for lock"Ben Bangert 2013-01-12, 17:39
On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote:
> Suppose the network link betweens client1 and server is at very low > quality (high packet loss rate?) but still fully functional. > > Client1 may be happily sending heart-beat-messages to server without > notice anything; but ZK server could be unable to receive > heart-beat-messages from client1 for a long period of time , which > leads ZK server to timeout client1's session, and delete the ephemeral > node. If the ZK server doesn't get the ping, then it won't reply to it, and the client *should be* expecting a ping reply. However, it occurs to me that I don't actually check in my Python implementation that I actually *get* a ping reply, nor does the Java client afaik. The scenario below is what will actually happen on the ZK server though, so the client will react appropriately then. > Thus, client's session could be timeouted by ZK server, without > triggering a Disconnect event. Then the ZK server will tear down the connection, the client will definitely notice the other side of the TCP connection going down and should react appropriately. >> Well behaving ZK applications must watch for this and assume that it no longer holds the lock and, thus, should delete its node. If client1 needs the lock again it should try to re-acquire it from step 1 of the recipe. Further, well behaving ZK applications must re-try node deletes if there is a connection problem. Have a look at Curator's implementation for details. > > Thanks for pointing me the "Curator's implementation", I will dig into > the source code. > > But I still feels that, no matter how well a ZK application behaves, > if we use ephemeral node in the lock-recipe; we can not guarantee "at > any snapshot in time no two clients think they hold the same lock", > which is the fundamental requirement/constraint for a lock. If both clients fetch the children, using ephemeral sequential nodes guarantees that every node has a sequence number appended, and only one will have the lowest number. At the *very least* the lock holder knows that it has the lock for the amount of time equal to the session expiration, if the connection is torn down or becomes otherwise disconnected the client will generate an event that code should listen for to react appropriately when there is no longer a guarantee that the lock is held. > Mr. Andrey Stepachev suggested that I should use a timer in client > side to track session_timeout, that sounds reasonable; but I think > this implicitly implies some constrains of clock drift - which I am > not expected in a solution based on Zookeeper (ZK is supposed to keep > the animals well). An alternative implementation that would alleviate this worry (though introduce the risk of dead-locks) would be to not use ephemeral sequential nodes, and just sequential nodes. This means that a lock would *never* be released until the client releases it, which might be more appropriate for you if this lock is governing something so important. Though you will of course need something else to alert you if its possible you're in a dead-lock scenario (client dies without releasing lock). -- /////^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\\\\\ || Ben Bangert || || [EMAIL PROTECTED] || || http://be.groovie.org/ || \\\\\_________________________________________________________///// +
Ben Bangert 2013-01-12, 17:39
-
Re: Getting confused with the "recipe for lock"Jordan Zimmerman 2013-01-13, 01:31
On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote:
> Suppose the network link betweens client1 and server is at very low > quality (high packet loss rate?) but still fully functional. > > Client1 may be happily sending heart-beat-messages to server without > notice anything; but ZK server could be unable to receive > heart-beat-messages from client1 for a long period of time , which > leads ZK server to timeout client1's session, and delete the ephemeral > node I believe the heartbeats go both ways. Thus, if the client doesn't hear from the server it will post a Disconnected event. > But I still feels that, no matter how well a ZK application behaves, > if we use ephemeral node in the lock-recipe; we can not guarantee "at > any snapshot in time no two clients think they hold the same lock", > which is the fundamental requirement/constraint for a lock. Assuming the clocks are in sync between all participants… The server and the client that holds the lock should determine that there is a disconnection at nearly the same time. I imagine that there is a certain amount of time (a few milliseconds) overlap here. But, the next client wouldn't get the notification immediately anyway. Further, when the next client gets the notification, it still needs to execute a getChildren() command, process the results, etc. before it can determine that it has the lock. That two clients would think they have the lock at the same time is a vanishingly small possibility. Even if it did happen it would only be for a few milliseconds at most. Someone with better understanding of ZK internals can correct me, but this is my understanding. -Jordan +
Jordan Zimmerman 2013-01-13, 01:31
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-13, 15:05
Thanks Jordan,
> Assuming the clocks are in sync between all participants… imho, perfect clock synchronization in a distributed system is very hard (if it can be). > Someone with better understanding of ZK internals can correct me, but this is my understanding. I think I might have missed some very important and subtile(or obvious?) points of the recipe / ZK protocol. I just can not believe that, there could be such type of a flaw in the lock-recipe, for so long time, without anybody has pointed it out. On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman <[EMAIL PROTECTED]> wrote: > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: > >> Suppose the network link betweens client1 and server is at very low >> quality (high packet loss rate?) but still fully functional. >> >> Client1 may be happily sending heart-beat-messages to server without >> notice anything; but ZK server could be unable to receive >> heart-beat-messages from client1 for a long period of time , which >> leads ZK server to timeout client1's session, and delete the ephemeral >> node > > I believe the heartbeats go both ways. Thus, if the client doesn't hear from the server it will post a Disconnected event. > >> But I still feels that, no matter how well a ZK application behaves, >> if we use ephemeral node in the lock-recipe; we can not guarantee "at >> any snapshot in time no two clients think they hold the same lock", >> which is the fundamental requirement/constraint for a lock. > > Assuming the clocks are in sync between all participants… The server and the client that holds the lock should determine that there is a disconnection at nearly the same time. I imagine that there is a certain amount of time (a few milliseconds) overlap here. But, the next client wouldn't get the notification immediately anyway. Further, when the next client gets the notification, it still needs to execute a getChildren() command, process the results, etc. before it can determine that it has the lock. That two clients would think they have the lock at the same time is a vanishingly small possibility. Even if it did happen it would only be for a few milliseconds at most. > > Someone with better understanding of ZK internals can correct me, but this is my understanding. > > -Jordan +
Hulunbier 2013-01-13, 15:05
-
Re: Getting confused with the "recipe for lock"Vitalii Tymchyshyn 2013-01-14, 10:37
I don't see why clock must be in sync. They are counting time periods
(timeouts). Also note the difference between disconnected and session expired events. This time difference is when client knows "something's wrong", but another client did not get a lock yet. You will have problems if client can't react (and release resources) between this two events. Best regards, Vitalii Tymchyshyn 2013/1/13 Hulunbier <[EMAIL PROTECTED]> > Thanks Jordan, > > > Assuming the clocks are in sync between all participants… > > imho, perfect clock synchronization in a distributed system is very > hard (if it can be). > > > Someone with better understanding of ZK internals can correct me, but > this is my understanding. > > I think I might have missed some very important and subtile(or > obvious?) points of the recipe / ZK protocol. > > I just can not believe that, there could be such type of a flaw in the > lock-recipe, for so long time, without anybody has pointed it out. > > On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman > <[EMAIL PROTECTED]> wrote: > > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: > > > >> Suppose the network link betweens client1 and server is at very low > >> quality (high packet loss rate?) but still fully functional. > >> > >> Client1 may be happily sending heart-beat-messages to server without > >> notice anything; but ZK server could be unable to receive > >> heart-beat-messages from client1 for a long period of time , which > >> leads ZK server to timeout client1's session, and delete the ephemeral > >> node > > > > I believe the heartbeats go both ways. Thus, if the client doesn't hear > from the server it will post a Disconnected event. > > > >> But I still feels that, no matter how well a ZK application behaves, > >> if we use ephemeral node in the lock-recipe; we can not guarantee "at > >> any snapshot in time no two clients think they hold the same lock", > >> which is the fundamental requirement/constraint for a lock. > > > > Assuming the clocks are in sync between all participants… The server and > the client that holds the lock should determine that there is a > disconnection at nearly the same time. I imagine that there is a certain > amount of time (a few milliseconds) overlap here. But, the next client > wouldn't get the notification immediately anyway. Further, when the next > client gets the notification, it still needs to execute a getChildren() > command, process the results, etc. before it can determine that it has the > lock. That two clients would think they have the lock at the same time is a > vanishingly small possibility. Even if it did happen it would only be for a > few milliseconds at most. > > > > Someone with better understanding of ZK internals can correct me, but > this is my understanding. > > > > -Jordan > -- Best regards, Vitalii Tymchyshyn +
Vitalii Tymchyshyn 2013-01-14, 10:37
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-14, 15:06
Hi Vitalii,
> I don't see why clock must be in sync. I don't see any reason to precisely sync the clocks either (but if we could ... that would be wonderful.). By *some constrains of clock drift*, I mean : "Every node has a clock, and all clocks increase at the same rate" or "the server’s clock advance no faster than a known constant factor faster than the client’s.". >Also note the difference between disconnected and session > expired events. This time difference is when client knows "something's > wrong", but another client did not get a lock yet. sorry, but I failed to get your idea well; would you please give me some further explanation? On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote: > I don't see why clock must be in sync. They are counting time periods > (timeouts). Also note the difference between disconnected and session > expired events. This time difference is when client knows "something's > wrong", but another client did not get a lock yet. You will have problems > if client can't react (and release resources) between this two events. > > Best regards, Vitalii Tymchyshyn > > > 2013/1/13 Hulunbier <[EMAIL PROTECTED]> > >> Thanks Jordan, >> >> > Assuming the clocks are in sync between all participants… >> >> imho, perfect clock synchronization in a distributed system is very >> hard (if it can be). >> >> > Someone with better understanding of ZK internals can correct me, but >> this is my understanding. >> >> I think I might have missed some very important and subtile(or >> obvious?) points of the recipe / ZK protocol. >> >> I just can not believe that, there could be such type of a flaw in the >> lock-recipe, for so long time, without anybody has pointed it out. >> >> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman >> <[EMAIL PROTECTED]> wrote: >> > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: >> > >> >> Suppose the network link betweens client1 and server is at very low >> >> quality (high packet loss rate?) but still fully functional. >> >> >> >> Client1 may be happily sending heart-beat-messages to server without >> >> notice anything; but ZK server could be unable to receive >> >> heart-beat-messages from client1 for a long period of time , which >> >> leads ZK server to timeout client1's session, and delete the ephemeral >> >> node >> > >> > I believe the heartbeats go both ways. Thus, if the client doesn't hear >> from the server it will post a Disconnected event. >> > >> >> But I still feels that, no matter how well a ZK application behaves, >> >> if we use ephemeral node in the lock-recipe; we can not guarantee "at >> >> any snapshot in time no two clients think they hold the same lock", >> >> which is the fundamental requirement/constraint for a lock. >> > >> > Assuming the clocks are in sync between all participants… The server and >> the client that holds the lock should determine that there is a >> disconnection at nearly the same time. I imagine that there is a certain >> amount of time (a few milliseconds) overlap here. But, the next client >> wouldn't get the notification immediately anyway. Further, when the next >> client gets the notification, it still needs to execute a getChildren() >> command, process the results, etc. before it can determine that it has the >> lock. That two clients would think they have the lock at the same time is a >> vanishingly small possibility. Even if it did happen it would only be for a >> few milliseconds at most. >> > >> > Someone with better understanding of ZK internals can correct me, but >> this is my understanding. >> > >> > -Jordan >> > > > > -- > Best regards, > Vitalii Tymchyshyn +
Hulunbier 2013-01-14, 15:06
-
Re: Getting confused with the "recipe for lock"Vitalii Tymchyshyn 2013-01-14, 15:38
There are two events: disconnected and session expired. The ephemeral nodes
are removed after the second one. The client receives both. So to implement "at most one lock holder" scheme, client owning lock must think it've lost lock ownership since it've received disconnected event. So, there is period of time between disconnect and session expired when noone should have the lock. It's "safety" time to accomodate for time shifts, network latencies, lock ownership recheck interval (in case when client can't stop using resource immediatelly and simply checks regulary if it still holds the lock). 2013/1/14 Hulunbier <[EMAIL PROTECTED]> > Hi Vitalii, > > > I don't see why clock must be in sync. > > I don't see any reason to precisely sync the clocks either (but if we > could ... that would be wonderful.). > > By *some constrains of clock drift*, I mean : > > "Every node has a clock, and all clocks increase at the same rate" > or > "the server’s clock advance no faster than a known constant factor > faster than the client’s.". > > > >Also note the difference between disconnected and session > > expired events. This time difference is when client knows "something's > > wrong", but another client did not get a lock yet. > > sorry, but I failed to get your idea well; would you please give me > some further explanation? > > > On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> > wrote: > > I don't see why clock must be in sync. They are counting time periods > > (timeouts). Also note the difference between disconnected and session > > expired events. This time difference is when client knows "something's > > wrong", but another client did not get a lock yet. You will have problems > > if client can't react (and release resources) between this two events. > > > > Best regards, Vitalii Tymchyshyn > > > > > > 2013/1/13 Hulunbier <[EMAIL PROTECTED]> > > > >> Thanks Jordan, > >> > >> > Assuming the clocks are in sync between all participants… > >> > >> imho, perfect clock synchronization in a distributed system is very > >> hard (if it can be). > >> > >> > Someone with better understanding of ZK internals can correct me, but > >> this is my understanding. > >> > >> I think I might have missed some very important and subtile(or > >> obvious?) points of the recipe / ZK protocol. > >> > >> I just can not believe that, there could be such type of a flaw in the > >> lock-recipe, for so long time, without anybody has pointed it out. > >> > >> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman > >> <[EMAIL PROTECTED]> wrote: > >> > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: > >> > > >> >> Suppose the network link betweens client1 and server is at very low > >> >> quality (high packet loss rate?) but still fully functional. > >> >> > >> >> Client1 may be happily sending heart-beat-messages to server without > >> >> notice anything; but ZK server could be unable to receive > >> >> heart-beat-messages from client1 for a long period of time , which > >> >> leads ZK server to timeout client1's session, and delete the > ephemeral > >> >> node > >> > > >> > I believe the heartbeats go both ways. Thus, if the client doesn't > hear > >> from the server it will post a Disconnected event. > >> > > >> >> But I still feels that, no matter how well a ZK application behaves, > >> >> if we use ephemeral node in the lock-recipe; we can not guarantee "at > >> >> any snapshot in time no two clients think they hold the same lock", > >> >> which is the fundamental requirement/constraint for a lock. > >> > > >> > Assuming the clocks are in sync between all participants… The server > and > >> the client that holds the lock should determine that there is a > >> disconnection at nearly the same time. I imagine that there is a certain > >> amount of time (a few milliseconds) overlap here. But, the next client > >> wouldn't get the notification immediately anyway. Further, when the next > >> client gets the notification, it still needs to execute a getChildren() Best regards, Vitalii Tymchyshyn +
Vitalii Tymchyshyn 2013-01-14, 15:38
-
Re: Getting confused with the "recipe for lock"Ted Dunning 2013-01-14, 16:05
Yes.
And in general, you can't have precise distributed lock control. There will always be a bit of slop. So decide which penalty is easier to pay. Do you want "at-most-one" or "at-least-one" or something in between? You can't have "exactly-one" and still deal with expected problems like partition or node failure. On Mon, Jan 14, 2013 at 7:38 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]>wrote: > There are two events: disconnected and session expired. The ephemeral nodes > are removed after the second one. The client receives both. So to > implement "at most one lock holder" scheme, client owning lock must think > it've lost lock ownership since it've received disconnected event. So, > there is period of time between disconnect and session expired when noone > should have the lock. It's "safety" time to accomodate for time shifts, > network latencies, lock ownership recheck interval (in case when client > can't stop using resource immediatelly and simply checks regulary if it > still holds the lock). > > > > 2013/1/14 Hulunbier <[EMAIL PROTECTED]> > > > Hi Vitalii, > > > > > I don't see why clock must be in sync. > > > > I don't see any reason to precisely sync the clocks either (but if we > > could ... that would be wonderful.). > > > > By *some constrains of clock drift*, I mean : > > > > "Every node has a clock, and all clocks increase at the same rate" > > or > > "the server’s clock advance no faster than a known constant factor > > faster than the client’s.". > > > > > > >Also note the difference between disconnected and session > > > expired events. This time difference is when client knows "something's > > > wrong", but another client did not get a lock yet. > > > > sorry, but I failed to get your idea well; would you please give me > > some further explanation? > > > > > > On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> > > wrote: > > > I don't see why clock must be in sync. They are counting time periods > > > (timeouts). Also note the difference between disconnected and session > > > expired events. This time difference is when client knows "something's > > > wrong", but another client did not get a lock yet. You will have > problems > > > if client can't react (and release resources) between this two events. > > > > > > Best regards, Vitalii Tymchyshyn > > > > > > > > > 2013/1/13 Hulunbier <[EMAIL PROTECTED]> > > > > > >> Thanks Jordan, > > >> > > >> > Assuming the clocks are in sync between all participants… > > >> > > >> imho, perfect clock synchronization in a distributed system is very > > >> hard (if it can be). > > >> > > >> > Someone with better understanding of ZK internals can correct me, > but > > >> this is my understanding. > > >> > > >> I think I might have missed some very important and subtile(or > > >> obvious?) points of the recipe / ZK protocol. > > >> > > >> I just can not believe that, there could be such type of a flaw in the > > >> lock-recipe, for so long time, without anybody has pointed it out. > > >> > > >> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman > > >> <[EMAIL PROTECTED]> wrote: > > >> > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: > > >> > > > >> >> Suppose the network link betweens client1 and server is at very low > > >> >> quality (high packet loss rate?) but still fully functional. > > >> >> > > >> >> Client1 may be happily sending heart-beat-messages to server > without > > >> >> notice anything; but ZK server could be unable to receive > > >> >> heart-beat-messages from client1 for a long period of time , which > > >> >> leads ZK server to timeout client1's session, and delete the > > ephemeral > > >> >> node > > >> > > > >> > I believe the heartbeats go both ways. Thus, if the client doesn't > > hear > > >> from the server it will post a Disconnected event. > > >> > > > >> >> But I still feels that, no matter how well a ZK application > behaves, > > >> >> if we use ephemeral node in the lock-recipe; we can not guarantee +
Ted Dunning 2013-01-14, 16:05
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-15, 02:28
Thanks Ted,
> And in general, you can't have precise distributed lock control. There > will always be a bit of slop. Yes, I agree with you. > So decide which penalty is easier to pay. Do you want "at-most-one" or > "at-least-one" or something in between? You can't have "exactly-one" and > still deal with expected problems like partition or node failure. Yes again, I feel the same way. IMHO, a lock(basic lock, not R/W lock) should be exclusive by nature. *If* really there was such flaw in the recipe, imho, they should not claim "at any snapshot in time no two clients think they hold the same lock" , at least with some notes; it is ... misleading. On Tue, Jan 15, 2013 at 12:05 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Yes. > > And in general, you can't have precise distributed lock control. There > will always be a bit of slop. > > So decide which penalty is easier to pay. Do you want "at-most-one" or > "at-least-one" or something in between? You can't have "exactly-one" and > still deal with expected problems like partition or node failure. > > > On Mon, Jan 14, 2013 at 7:38 AM, Vitalii Tymchyshyn <[EMAIL PROTECTED]>wrote: > >> There are two events: disconnected and session expired. The ephemeral nodes >> are removed after the second one. The client receives both. So to >> implement "at most one lock holder" scheme, client owning lock must think >> it've lost lock ownership since it've received disconnected event. So, >> there is period of time between disconnect and session expired when noone >> should have the lock. It's "safety" time to accomodate for time shifts, >> network latencies, lock ownership recheck interval (in case when client >> can't stop using resource immediatelly and simply checks regulary if it >> still holds the lock). >> >> >> >> 2013/1/14 Hulunbier <[EMAIL PROTECTED]> >> >> > Hi Vitalii, >> > >> > > I don't see why clock must be in sync. >> > >> > I don't see any reason to precisely sync the clocks either (but if we >> > could ... that would be wonderful.). >> > >> > By *some constrains of clock drift*, I mean : >> > >> > "Every node has a clock, and all clocks increase at the same rate" >> > or >> > "the server’s clock advance no faster than a known constant factor >> > faster than the client’s.". >> > >> > >> > >Also note the difference between disconnected and session >> > > expired events. This time difference is when client knows "something's >> > > wrong", but another client did not get a lock yet. >> > >> > sorry, but I failed to get your idea well; would you please give me >> > some further explanation? >> > >> > >> > On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> >> > wrote: >> > > I don't see why clock must be in sync. They are counting time periods >> > > (timeouts). Also note the difference between disconnected and session >> > > expired events. This time difference is when client knows "something's >> > > wrong", but another client did not get a lock yet. You will have >> problems >> > > if client can't react (and release resources) between this two events. >> > > >> > > Best regards, Vitalii Tymchyshyn >> > > >> > > >> > > 2013/1/13 Hulunbier <[EMAIL PROTECTED]> >> > > >> > >> Thanks Jordan, >> > >> >> > >> > Assuming the clocks are in sync between all participants… >> > >> >> > >> imho, perfect clock synchronization in a distributed system is very >> > >> hard (if it can be). >> > >> >> > >> > Someone with better understanding of ZK internals can correct me, >> but >> > >> this is my understanding. >> > >> >> > >> I think I might have missed some very important and subtile(or >> > >> obvious?) points of the recipe / ZK protocol. >> > >> >> > >> I just can not believe that, there could be such type of a flaw in the >> > >> lock-recipe, for so long time, without anybody has pointed it out. >> > >> >> > >> On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman >> > >> <[EMAIL PROTECTED]> wrote: >> > >> > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: +
Hulunbier 2013-01-15, 02:28
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-15, 01:52
Hi Vitalii,
Thanks a lot, got your idea. Suppose we are measuring the time of events outsides the system(zk & clients) . And we have no client side time tracking routine. And t_i < t_k if i < k t_0 : client1 has created lock/node1, client2 has created lock/node2; client1 thinks itself holding the lock; client2 does not, and watching lock/node1. t_1 : ZK thinks client1's session is timeout(let's say, client1 is actually failed to send heart-beat message on time, due to a long pause of jvm gc). ZK deletes lock/node1, sends timeout message to client1, sends "node_not_exist" message to client2 (or send this message before the deletion, but it does not matter in our case) but for some reason, link between zk and client1 becomes very unstable, high packet loss, large amount of packet retransmission, which leads to a significant packet transmission delay(between client1 and zk only), but the tcp connection is NOT broken. t_2: client2 got the "node_not_exist" event, and issues the getChildren Cmd t_3: client2 found the only node lock/node2, and thinks itself holding the lock, and begins acting like a lock owner. (at the same time, client1 is also thinking itself holding the lock) t_4: session_timeout message not reach client1 yet, client1's jvm gc completed, doing something as the lock-owner. t_5: network becomes stable, finally, the session_timeout message sent from zk reached client1; client1 thinks itself no longer holding the lock, but it is too late, it has done something really bad between t_4 and t_5. -------------------------- Sorry for the grammar, I am not a native English speaker. On Mon, Jan 14, 2013 at 11:38 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote: > There are two events: disconnected and session expired. The ephemeral nodes > are removed after the second one. The client receives both. So to > implement "at most one lock holder" scheme, client owning lock must think > it've lost lock ownership since it've received disconnected event. So, > there is period of time between disconnect and session expired when noone > should have the lock. It's "safety" time to accomodate for time shifts, > network latencies, lock ownership recheck interval (in case when client > can't stop using resource immediatelly and simply checks regulary if it > still holds the lock). > > > > 2013/1/14 Hulunbier <[EMAIL PROTECTED]> > >> Hi Vitalii, >> >> > I don't see why clock must be in sync. >> >> I don't see any reason to precisely sync the clocks either (but if we >> could ... that would be wonderful.). >> >> By *some constrains of clock drift*, I mean : >> >> "Every node has a clock, and all clocks increase at the same rate" >> or >> "the server’s clock advance no faster than a known constant factor >> faster than the client’s.". >> >> >> >Also note the difference between disconnected and session >> > expired events. This time difference is when client knows "something's >> > wrong", but another client did not get a lock yet. >> >> sorry, but I failed to get your idea well; would you please give me >> some further explanation? >> >> >> On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> >> wrote: >> > I don't see why clock must be in sync. They are counting time periods >> > (timeouts). Also note the difference between disconnected and session >> > expired events. This time difference is when client knows "something's >> > wrong", but another client did not get a lock yet. You will have problems >> > if client can't react (and release resources) between this two events. >> > >> > Best regards, Vitalii Tymchyshyn >> > >> > >> > 2013/1/13 Hulunbier <[EMAIL PROTECTED]> >> > >> >> Thanks Jordan, >> >> >> >> > Assuming the clocks are in sync between all participants… >> >> >> >> imho, perfect clock synchronization in a distributed system is very >> >> hard (if it can be). >> >> >> >> > Someone with better understanding of ZK internals can correct me, but >> >> this is my understanding. >> >> >> >> I think I might have missed some very important and subtile(or +
Hulunbier 2013-01-15, 01:52
-
Re: Getting confused with the "recipe for lock"Jordan Zimmerman 2013-01-15, 02:23
Why would client 1s connection be unstable but client 2s not? In any normal usage the ZK clients are going to be on the same network. Or, are you thinking cross-data-center usage? In my opinion, ZooKeeper is not suited to cross data center usage.
In any event, as others have pointed out, Zookeeper is _not_ a transactional system. It is an eventually consistent system that will give you a reasonable degree of distributed coordination semantics. There are edge cases as you describe but they are in the level of noise. -Jordan On Jan 14, 2013, at 5:52 PM, Hulunbier <[EMAIL PROTECTED]> wrote: > Hi Vitalii, > > Thanks a lot, got your idea. > > Suppose we are measuring the time of events outsides the system(zk & clients) . > > And we have no client side time tracking routine. > > And t_i < t_k if i < k > > t_0 : > > client1 has created lock/node1, client2 has created lock/node2; > client1 thinks itself holding the lock; client2 does not, and watching > lock/node1. > > t_1 : > > ZK thinks client1's session is timeout(let's say, client1 is actually > failed to send heart-beat message on time, due to a long pause of jvm > gc). > > ZK deletes lock/node1, > sends timeout message to client1, > sends "node_not_exist" message to client2 (or send this message before > the deletion, but it does not matter in our case) > > but for some reason, link between zk and client1 becomes very unstable, > high packet loss, large amount of packet retransmission, > which leads to a significant packet transmission delay(between client1 > and zk only), but the tcp connection is NOT broken. > > t_2: > > client2 got the "node_not_exist" event, and issues the getChildren Cmd > > t_3: > > client2 found the only node lock/node2, and thinks itself holding the > lock, and begins acting like a lock owner. > > (at the same time, client1 is also thinking itself holding the lock) > > t_4: > > session_timeout message not reach client1 yet, > > client1's jvm gc completed, doing something as the lock-owner. > > t_5: > > network becomes stable, finally, the session_timeout message sent from > zk reached client1; > > client1 thinks itself no longer holding the lock, but it is too late, > it has done something really bad between t_4 and t_5. > > -------------------------- > > Sorry for the grammar, I am not a native English speaker. > > > On Mon, Jan 14, 2013 at 11:38 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> wrote: >> There are two events: disconnected and session expired. The ephemeral nodes >> are removed after the second one. The client receives both. So to >> implement "at most one lock holder" scheme, client owning lock must think >> it've lost lock ownership since it've received disconnected event. So, >> there is period of time between disconnect and session expired when noone >> should have the lock. It's "safety" time to accomodate for time shifts, >> network latencies, lock ownership recheck interval (in case when client >> can't stop using resource immediatelly and simply checks regulary if it >> still holds the lock). >> >> >> >> 2013/1/14 Hulunbier <[EMAIL PROTECTED]> >> >>> Hi Vitalii, >>> >>>> I don't see why clock must be in sync. >>> >>> I don't see any reason to precisely sync the clocks either (but if we >>> could ... that would be wonderful.). >>> >>> By *some constrains of clock drift*, I mean : >>> >>> "Every node has a clock, and all clocks increase at the same rate" >>> or >>> "the server’s clock advance no faster than a known constant factor >>> faster than the client’s.". >>> >>> >>>> Also note the difference between disconnected and session >>>> expired events. This time difference is when client knows "something's >>>> wrong", but another client did not get a lock yet. >>> >>> sorry, but I failed to get your idea well; would you please give me >>> some further explanation? >>> >>> >>> On Mon, Jan 14, 2013 at 6:37 PM, Vitalii Tymchyshyn <[EMAIL PROTECTED]> >>> wrote: >>>> I don't see why clock must be in sync. They are counting time periods +
Jordan Zimmerman 2013-01-15, 02:23
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-15, 03:45
Hi Jordan,
> Why would client 1s connection be unstable but client 2s not? In any normal usage the ZK clients are going to be on the same network. Or, are you thinking cross-data-center usage? In my opinion, ZooKeeper is not suited to cross data center usage. er... the word "unstable" I used is misleading; A full functional(or stable?) tcp connection is supposed to be encountered with some network congestion, and should / can handle this situation well, but might be with some delay of delivering the segments; High volume of traffic in LAN may lead to the above situation, and it is not rare, I think. Even if there was no such congestion, there is always a time lag, between zk sends session-timeout message and client receives the message; Without any assumption, we can not ensure that , the client could be ware of that it no longer has the lock - before other clients got the node_not_exist notification and successful executed getChildren and thought it(one of the others) having the lock. I think in practice, we could (or have to) accept this assumption : "the server’s clock advance no faster than a known constant factor faster than the client’s". But the assumption itself is not enough for the correctness of lock protocol; because the client can only passively waiting for the session_time_out message, so the client may need a timer to explicitly check time elapsed. But the recipe claims clearly that: "at any snapshot in time no two clients think they hold the same lock", and "There is no polling or timeouts." > In any event, as others have pointed out, Zookeeper is _not_ a transactional system. > It is an eventually consistent system that will give you a reasonable degree of distributed coordination semantics. I should admit that I do not know whether ZK is eventually consistent , transactional or not. (BTW, there is a recipe for 2pc, and some guys claim that *Zab* is Sequential Consistent); Does these properties of ZK implies there is assumptions of clock drift? >There are edge cases as you describe but they are in the level of noise. You might be right, but for me, edge cases is what I am worrying about (please do not get me wrong, I mean, different applications have different requirements / constraints). > > -Jordan > > On Jan 14, 2013, at 5:52 PM, Hulunbier <[EMAIL PROTECTED]> wrote: > >> Hi Vitalii, >> >> Thanks a lot, got your idea. >> >> Suppose we are measuring the time of events outsides the system(zk & clients) . >> >> And we have no client side time tracking routine. >> >> And t_i < t_k if i < k >> >> t_0 : >> >> client1 has created lock/node1, client2 has created lock/node2; >> client1 thinks itself holding the lock; client2 does not, and watching >> lock/node1. >> >> t_1 : >> >> ZK thinks client1's session is timeout(let's say, client1 is actually >> failed to send heart-beat message on time, due to a long pause of jvm >> gc). >> >> ZK deletes lock/node1, >> sends timeout message to client1, >> sends "node_not_exist" message to client2 (or send this message before >> the deletion, but it does not matter in our case) >> >> but for some reason, link between zk and client1 becomes very unstable, >> high packet loss, large amount of packet retransmission, >> which leads to a significant packet transmission delay(between client1 >> and zk only), but the tcp connection is NOT broken. >> >> t_2: >> >> client2 got the "node_not_exist" event, and issues the getChildren Cmd >> >> t_3: >> >> client2 found the only node lock/node2, and thinks itself holding the >> lock, and begins acting like a lock owner. >> >> (at the same time, client1 is also thinking itself holding the lock) >> >> t_4: >> >> session_timeout message not reach client1 yet, >> >> client1's jvm gc completed, doing something as the lock-owner. >> >> t_5: >> >> network becomes stable, finally, the session_timeout message sent from >> zk reached client1; >> >> client1 thinks itself no longer holding the lock, but it is too late, >> it has done something really bad between t_4 and t_5. +
Hulunbier 2013-01-15, 03:45
-
Re: Getting confused with the "recipe for lock"Benjamin Reed 2013-01-15, 05:27
sorry to jump in the middle, but i thought i'd point out a couple of things.
at the heart of ZK is Zab, which is an atomic broadcast protocol (it actually has stronger guarantees than just atomic broadcast: it also guarantees primary order). updates go through this protocol which gives us sequential consistency for writes. failure detection uses timeouts, as most failure detectors do, so we have some assumptions on bounds of message delays and drifts of clocks. in the end, these assumptions are manifest in the sync and initial timeouts of the server and the session timeouts of the clients. as long as the assumptions are true, things will stay consistent, if the assumptions fail, such as when HBase region servers went into gc for many minutes and then woke up still thinking they are the leader, bad things can happen. the fix may be to use more conservative assumptions or to use a fencing scheme with external resources. if the assumptions are violated by the zookeeper cluster, it will manifest as a liveness problem rather than a safety issue. (in theory at least, we do have bugs occasionally :) ben On Mon, Jan 14, 2013 at 7:45 PM, Hulunbier <[EMAIL PROTECTED]> wrote: > Hi Jordan, > >> Why would client 1s connection be unstable but client 2s not? In any normal usage the ZK clients are going to be on the same network. Or, are you thinking cross-data-center usage? In my opinion, ZooKeeper is not suited to cross data center usage. > > er... the word "unstable" I used is misleading; A full functional(or > stable?) tcp connection is supposed to be encountered with some > network congestion, and should / can handle this situation well, but > might be with some delay of delivering the segments; High volume of > traffic in LAN may lead to the above situation, and it is not rare, I > think. > > Even if there was no such congestion, there is always a time lag, > between zk sends session-timeout message and client receives the > message; > Without any assumption, we can not ensure that , the client could be > ware of that it no longer has the lock - before other clients got the > node_not_exist notification and successful executed getChildren and > thought it(one of the others) having the lock. > > I think in practice, we could (or have to) accept this assumption : > "the server’s clock advance no faster than a known constant factor > faster than the client’s". > > But the assumption itself is not enough for the correctness of lock > protocol; because the client can only passively waiting for the > session_time_out message, so the client may need a timer to explicitly > check time elapsed. > > But the recipe claims clearly that: "at any snapshot in time no two > clients think they hold the same lock", and "There is no polling or > timeouts." > > >> In any event, as others have pointed out, Zookeeper is _not_ a transactional system. > >> It is an eventually consistent system that will give you a reasonable degree of distributed coordination semantics. > > I should admit that I do not know whether ZK is eventually consistent > , transactional or not. (BTW, there is a recipe for 2pc, and some guys > claim that *Zab* is Sequential Consistent); > > Does these properties of ZK implies there is assumptions of clock drift? > >>There are edge cases as you describe but they are in the level of noise. > > You might be right, but for me, edge cases is what I am worrying about > (please do not get me wrong, I mean, different applications have > different requirements / constraints). > >> >> -Jordan >> >> On Jan 14, 2013, at 5:52 PM, Hulunbier <[EMAIL PROTECTED]> wrote: >> >>> Hi Vitalii, >>> >>> Thanks a lot, got your idea. >>> >>> Suppose we are measuring the time of events outsides the system(zk & clients) . >>> >>> And we have no client side time tracking routine. >>> >>> And t_i < t_k if i < k >>> >>> t_0 : >>> >>> client1 has created lock/node1, client2 has created lock/node2; >>> client1 thinks itself holding the lock; client2 does not, and watching +
Benjamin Reed 2013-01-15, 05:27
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-15, 06:32
Benjamin,
thanks a lot, for your response and the great product you guys designed and implemented. > these assumptions are manifest in ... the session timeouts of the clients. Does this mean that session_expired event may be triggered all by zk-client-library itself ?(by something like a built-in client-local timer, without notification from zk server? ) (I am digging into the source code, but in case of misunderstanding of the code, I need your confirmation please) > HBase region servers went into gc for many minutes and then woke up still thinking they are the leader Could this happen if I just follow(correctly and without a client-local timer or external fencing resources) the recipe for distributed clock? On Tue, Jan 15, 2013 at 1:27 PM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > sorry to jump in the middle, but i thought i'd point out a couple of things. > > at the heart of ZK is Zab, which is an atomic broadcast protocol (it > actually has stronger guarantees than just atomic broadcast: it also > guarantees primary order). updates go through this protocol which > gives us sequential consistency for writes. > > failure detection uses timeouts, as most failure detectors do, so we > have some assumptions on bounds of message delays and drifts of > clocks. in the end, these assumptions are manifest in the sync and > initial timeouts of the server and the session timeouts of the > clients. > > as long as the assumptions are true, things will stay consistent, if > the assumptions fail, such as when HBase region servers went into gc > for many minutes and then woke up still thinking they are the leader, > bad things can happen. the fix may be to use more conservative > assumptions or to use a fencing scheme with external resources. > > if the assumptions are violated by the zookeeper cluster, it will > manifest as a liveness problem rather than a safety issue. (in theory > at least, we do have bugs occasionally :) > > ben > > On Mon, Jan 14, 2013 at 7:45 PM, Hulunbier <[EMAIL PROTECTED]> wrote: >> Hi Jordan, >> >>> Why would client 1s connection be unstable but client 2s not? In any normal usage the ZK clients are going to be on the same network. Or, are you thinking cross-data-center usage? In my opinion, ZooKeeper is not suited to cross data center usage. >> >> er... the word "unstable" I used is misleading; A full functional(or >> stable?) tcp connection is supposed to be encountered with some >> network congestion, and should / can handle this situation well, but >> might be with some delay of delivering the segments; High volume of >> traffic in LAN may lead to the above situation, and it is not rare, I >> think. >> >> Even if there was no such congestion, there is always a time lag, >> between zk sends session-timeout message and client receives the >> message; >> Without any assumption, we can not ensure that , the client could be >> ware of that it no longer has the lock - before other clients got the >> node_not_exist notification and successful executed getChildren and >> thought it(one of the others) having the lock. >> >> I think in practice, we could (or have to) accept this assumption : >> "the server’s clock advance no faster than a known constant factor >> faster than the client’s". >> >> But the assumption itself is not enough for the correctness of lock >> protocol; because the client can only passively waiting for the >> session_time_out message, so the client may need a timer to explicitly >> check time elapsed. >> >> But the recipe claims clearly that: "at any snapshot in time no two >> clients think they hold the same lock", and "There is no polling or >> timeouts." >> >> >>> In any event, as others have pointed out, Zookeeper is _not_ a transactional system. >> >>> It is an eventually consistent system that will give you a reasonable degree of distributed coordination semantics. >> >> I should admit that I do not know whether ZK is eventually consistent >> , transactional or not. (BTW, there is a recipe for 2pc, and some guys +
Hulunbier 2013-01-15, 06:32
-
Re: Getting confused with the "recipe for lock"Ted Dunning 2013-01-17, 11:43
On Mon, Jan 14, 2013 at 10:32 PM, Hulunbier <[EMAIL PROTECTED]> wrote:
> > HBase region servers went into gc for many minutes and then woke up > still thinking they are the leader > > Could this happen if I just follow(correctly and without a > client-local timer or external fencing resources) the recipe for > distributed clock? How is a distributed clock going to work if you can't talk to zookeeper? +
Ted Dunning 2013-01-17, 11:43
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-18, 08:26
Hi Ted,
>> Could this happen if I just follow(correctly and without a >> client-local timer or external fencing resources) the recipe for >> distributed CLOCK? > > > How is a distributed clock going to work if you can't talk to zookeeper? Sorry, it is a typo , I mean "... the recipe for distributed LOCK". On Thu, Jan 17, 2013 at 7:43 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Mon, Jan 14, 2013 at 10:32 PM, Hulunbier <[EMAIL PROTECTED]> wrote: > >> > HBase region servers went into gc for many minutes and then woke up >> still thinking they are the leader >> >> Could this happen if I just follow(correctly and without a >> client-local timer or external fencing resources) the recipe for >> distributed clock? > > > How is a distributed clock going to work if you can't talk to zookeeper? +
Hulunbier 2013-01-18, 08:26
-
Re: Getting confused with the "recipe for lock"Benjamin Reed 2013-01-17, 04:28
> Does this mean that session_expired event may be triggered all by
> zk-client-library itself ?(by something like a built-in client-local > timer, without notification from zk server? ) > > (I am digging into the source code, but in case of misunderstanding of > the code, I need your confirmation please) our general rule with zookeeper is to either give the correct answer or say "i don't know". connection loss is the zookeeper version of i don't know. you will only get the session expired event when the client gets confirmation from a server that the session is really gone. this means that even if a client is disconnected for days, you will not get the session expired until client connects to the server and the server tells it that the connection has expired. > >> HBase region servers went into gc for many minutes and then woke up still thinking they are the leader > > Could this happen if I just follow(correctly and without a > client-local timer or external fencing resources) the recipe for > distributed clock? this did happen with hbase. client-local timers don't help. there are multiple problems going on: if a process is the leader and the gc freezes time (or the process gets swapped out or the hypervisor suspends the vm) right before the instruction that can only be executed by a leader (send database update for example), when time unfreezes, the rest of the system knows the leader has changed, another thread in the process might be figuring it out, local timers might be ready to fire, but the "leader instruction" will execute since that is the next instruction for the CPU to execute. ben +
Benjamin Reed 2013-01-17, 04:28
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-17, 09:05
Benjamin,
Greatly appreciate your thorough explanation, thanks a lot! hunlunbier On Thu, Jan 17, 2013 at 12:28 PM, Benjamin Reed <[EMAIL PROTECTED]> wrote: >> Does this mean that session_expired event may be triggered all by >> zk-client-library itself ?(by something like a built-in client-local >> timer, without notification from zk server? ) >> >> (I am digging into the source code, but in case of misunderstanding of >> the code, I need your confirmation please) > > our general rule with zookeeper is to either give the correct answer > or say "i don't know". connection loss is the zookeeper version of i > don't know. you will only get the session expired event when the > client gets confirmation from a server that the session is really > gone. this means that even if a client is disconnected for days, you > will not get the session expired until client connects to the server > and the server tells it that the connection has expired. > >> >>> HBase region servers went into gc for many minutes and then woke up still thinking they are the leader >> >> Could this happen if I just follow(correctly and without a >> client-local timer or external fencing resources) the recipe for >> distributed clock? > > this did happen with hbase. client-local timers don't help. there are > multiple problems going on: if a process is the leader and the gc > freezes time (or the process gets swapped out or the hypervisor > suspends the vm) right before the instruction that can only be > executed by a leader (send database update for example), when time > unfreezes, the rest of the system knows the leader has changed, > another thread in the process might be figuring it out, local timers > might be ready to fire, but the "leader instruction" will execute > since that is the next instruction for the CPU to execute. > > ben +
Hulunbier 2013-01-17, 09:05
-
Re: Getting confused with the "recipe for lock"Vitalii Tymchyshyn 2013-01-27, 19:29
BTW: We've had to introduce external cron script that were killing
processes spending too much in GC. That helped in similar situation. 2013/1/16 Benjamin Reed <[EMAIL PROTECTED]> > > Does this mean that session_expired event may be triggered all by > > zk-client-library itself ?(by something like a built-in client-local > > timer, without notification from zk server? ) > > > > (I am digging into the source code, but in case of misunderstanding of > > the code, I need your confirmation please) > > our general rule with zookeeper is to either give the correct answer > or say "i don't know". connection loss is the zookeeper version of i > don't know. you will only get the session expired event when the > client gets confirmation from a server that the session is really > gone. this means that even if a client is disconnected for days, you > will not get the session expired until client connects to the server > and the server tells it that the connection has expired. > > > > >> HBase region servers went into gc for many minutes and then woke up > still thinking they are the leader > > > > Could this happen if I just follow(correctly and without a > > client-local timer or external fencing resources) the recipe for > > distributed clock? > > this did happen with hbase. client-local timers don't help. there are > multiple problems going on: if a process is the leader and the gc > freezes time (or the process gets swapped out or the hypervisor > suspends the vm) right before the instruction that can only be > executed by a leader (send database update for example), when time > unfreezes, the rest of the system knows the leader has changed, > another thread in the process might be figuring it out, local timers > might be ready to fire, but the "leader instruction" will execute > since that is the next instruction for the CPU to execute. > > ben > -- Best regards, Vitalii Tymchyshyn +
Vitalii Tymchyshyn 2013-01-27, 19:29
-
Re: Getting confused with the "recipe for lock"Hulunbier 2013-01-13, 14:40
Thanks Ben,
> Then the ZK server will tear down the connection, the client will definitely notice the other side of the TCP connection going down and should react appropriately. It takes time for the tcp segments (FIN) to be sent from server to client. Even no bad things such as high packet loss rate happen, the recipe itself can not ensure that, during the above time gap, client2 will never think itself successfully acquired the lock. >An alternative implementation that would alleviate this worry (though introduce the risk of dead-locks) would be to not use ephemeral sequential nodes, and just sequential nodes. This means that a lock would *never* be released until the client releases it, which might be more appropriate for you if this lock is governing something so important. Though you will of course need something else to alert you if its possible you're in a dead-lock scenario (client dies without releasing lock). I do agree with you on this point. On Sun, Jan 13, 2013 at 9:31 AM, Jordan Zimmerman <[EMAIL PROTECTED]> wrote: > On Jan 12, 2013, at 2:30 AM, Hulunbier <[EMAIL PROTECTED]> wrote: > >> Suppose the network link betweens client1 and server is at very low >> quality (high packet loss rate?) but still fully functional. >> >> Client1 may be happily sending heart-beat-messages to server without >> notice anything; but ZK server could be unable to receive >> heart-beat-messages from client1 for a long period of time , which >> leads ZK server to timeout client1's session, and delete the ephemeral >> node > > I believe the heartbeats go both ways. Thus, if the client doesn't hear from the server it will post a Disconnected event. > >> But I still feels that, no matter how well a ZK application behaves, >> if we use ephemeral node in the lock-recipe; we can not guarantee "at >> any snapshot in time no two clients think they hold the same lock", >> which is the fundamental requirement/constraint for a lock. > > Assuming the clocks are in sync between all participants… The server and the client that holds the lock should determine that there is a disconnection at nearly the same time. I imagine that there is a certain amount of time (a few milliseconds) overlap here. But, the next client wouldn't get the notification immediately anyway. Further, when the next client gets the notification, it still needs to execute a getChildren() command, process the results, etc. before it can determine that it has the lock. That two clients would think they have the lock at the same time is a vanishingly small possibility. Even if it did happen it would only be for a few milliseconds at most. > > Someone with better understanding of ZK internals can correct me, but this is my understanding. > > -Jordan +
Hulunbier 2013-01-13, 14:40
|