We have a Zookeeper ensemble that spend across multiple data centers (each participant is in a different datacenter). Recently, we ran into an issue when trying to support low session time (5 seconds). We set tickTime to be 2 seconds and syncLimit to 25.
The using case is a single master. We can only have one master at any given time. The active master create an ephemeral node. The backup master watch of this ephemeral node to be deleted before it take over the master role.
The active master is connecting to the follower (F1) in its data center. We believe that a network delay between F1 and the leader cause the touchTable to not propagate in a timely manner. The leader decide to close the session due to timeout. Ephemeral node delete event reach the other follower (F2) before the close session event reach F1. The backup master which is connecting to F2 got the ephemeral delete and assume the role of the active master.
>From our log, the active master saw session expire event 14 seconds after the backup master receive ephemeral node delete event.
I tried to looked at code, but from my current understanding. We don't have logic that enforce upper bound in which a particular follower can lag behind (in term of data tree processing). This means some part of the system may see that the lock is release is before the previous owner release them.
Another issue that I saw is in this case that, the client maintains internal clock on when its session should expire based on its connectivity with the follow. However, the leader internal clock (session tracker) use information that get relayed from the follower via touchTable. As a result, the both party may decide when the session is expired differently if there are network issue between follower and leader.
Our internal Zookeeper is based on 3.4.3.
Patrick Hunt 2012-06-08, 22:11