> * First, How do clients know where to send the request, are they monitoring
> some ephemeral znodes in zookeeper. If yes, then after session timeout some
> other server should notice the ephemeral znode disappearing and recreate
> another ephemeral znode. Clients should not start sending request to new
> server, basically the server which is disconnected from zk permanently
> will never get any new requests.
The client looks up the owner of the partition before making a
request. I could have the client monitor ZK and cancel the request if
the owner changes. Alternatively, I could have a the client requests
timeout after a period and then recheck the ownership.
> * Second, how much tolerance do you have for downtime? if you are ok for
> not a server to serve a partition for X seconds ( session timeout), then on
> SyncDisconnected you can stop accepting requests and resume on
The server will never receive SessionExpired though, so my server
needs to be aware of the session timeout. This is actually what we're
doing; Waiting X seconds for SyncConnected and if it doesn't come,
shutting down. This feels kludgy though.
> > SyncDisconnected can occur for a variety of reasons. It's in the class of
> > recoverable errors. Your app needs to go into a waiting state until
> > SysConnected is retrieved again or SessionExpired. Have you read
> > http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling ?
But how long to wait? If the server is truly partitioned from ZK, then
I'll wait forever, and the client request will be hung forever.
> > You should consider using one of the high level ZooKeeper frameworks (such
> > as Curator which I wrote).
Conceptually the issue would still exist though. A high level library
doesn't solve the problem if the problem can't be solved with raw