On Thu, Jun 7, 2012 at 11:38 AM, Thawan Kooburat <[EMAIL PROTECTED]> wrote:
> We have a Zookeeper ensemble that spend across multiple data centers (each participant is in a different datacenter). Recently, we ran into an issue when trying to support low session time (5 seconds). We set tickTime to be 2 seconds and syncLimit to 25.
syncLimit of 25 with tickTime of 2000 means that you are allowing the
followers to run up to 50 seconds behind the leader.
> The using case is a single master. We can only have one master at any given time. The active master create an ephemeral node. The backup master watch of this ephemeral node to be deleted before it take over the master role.
> The active master is connecting to the follower (F1) in its data center. We believe that a network delay between F1 and the leader cause the touchTable to not propagate in a timely manner. The leader decide to close the session due to timeout. Ephemeral node delete event reach the other follower (F2) before the close session event reach F1. The backup master which is connecting to F2 got the ephemeral delete and assume the role of the active master.
> From our log, the active master saw session expire event 14 seconds after the backup master receive ephemeral node delete event.
This is a consequence of setting the session timeout lower than the syncLimit.
a) the leader will expire the session after 5 seconds of not hearing
from the client
b) the follower F1 can run up to 50 seconds behind the leader. i.e. no
communication btw the follower and leader, incl client heartbeat
c) let's say that F2 has perfect communication
in which case the leader might decide that the session is expired and
notify the followers. F2 gets the result quickly, F1 does not.
Typically what happens is that the follower will fall out of the
quorum before the session has a chance to expire, at which point the
client will get disconnected from the follower immediately (follower
out of quorum closes all client connections until it's able to
> I tried to looked at code, but from my current understanding. We don't have logic that enforce upper bound in which a particular follower can lag behind (in term of data tree processing). This means some part of the system may see that the lock is release is before the previous owner release them.
see org.apache.zookeeper.server.quorum.LearnerHandler.synced() called
There is no guarantee that all clients see the events at the same
time. Only that they see them in the same order. There's always a
possibility of a race where the client on F2 sees the znode removed
before the client on F1. This effect is magnified in a cross DC
scenario. Also consider there is a lag btw server/client communication
Have you looked at ZooKeeper.sync? This ensures that the follower is
up to date with the leader (at the time sync is processed). This may
or may not allow you to resolve the problem for this particular use
case though... (the syncLimit vs timeout being the key issue)
> Another issue that I saw is in this case that, the client maintains internal clock on when its session should expire based on its connectivity with the follow. However, the leader internal clock (session tracker) use information that get relayed from the follower via touchTable. As a result, the both party may decide when the session is expired differently if there are network issue between follower and leader.
The client only tracks when it should disconnect from the server, this
is not involved with session expiration per se. The Leader is tracking
session expiration relative to the last time he heard a heartbeat from
the client (max gap being the session timeout).