Couple of questions,
* First, How do clients know where to send the request, are they monitoring
some ephemeral znodes in zookeeper. If yes, then after session timeout some
other server should notice the ephemeral znode disappearing and recreate
another ephemeral znode. Clients should not start sending request to new
server, basically the server which is disconnected from zk permanently
will never get any new requests.
* Second, how much tolerance do you have for downtime? if you are ok for
not a server to serve a partition for X seconds ( session timeout), then on
SyncDisconnected you can stop accepting requests and resume on
There are couple of other approaches to detect network partition, 1) try to
communicate with other nodes( some kind of gossip protocol) and see if they
can reach zk and based on the response put itself in limbo state etc or (2)
have a central co-ordinator and check with co-ordinator if it should
continue processing or not. None of these are trivial to implement
On Wed, Mar 13, 2013 at 2:21 PM, Jordan Zimmerman <
[EMAIL PROTECTED]> wrote:
> SyncDisconnected can occur for a variety of reasons. It's in the class of
> recoverable errors. Your app needs to go into a waiting state until
> SysConnected is retrieved again or SessionExpired. Have you read
> http://wiki.apache.org/hadoop/ZooKeeper/ErrorHandling ?
> You should consider using one of the high level ZooKeeper frameworks (such
> as Curator which I wrote).
> On Mar 13, 2013, at 2:01 PM, Ivan Kelly <[EMAIL PROTECTED]> wrote:
> > Hi guys,
> > We have a usecase here where zookeeper is used to coordinate ownership
> > of partitions of a resource. When one server dies, the partition
> > should be moved to another server, etc. The action we need to take on
> > SessionExpired is very clear. We just kill the server.
> > However it is unclear what we should do on a SyncDisconnected. We
> > can't just kill our server, as it may have just been one zookeeper
> > server failing. If we block all client requests to our server while we
> > wait for SyncConnected, we may block forever in the case that our
> > server is partitioned away from the zk cluster. If we continue to
> > serve requests, we risk split brain.
> > What have people done in the past to resolve issues like this?
> > -Ivan
> >  This is a risk anyhow without proper fencing, but a limited amount
> > is ok in our application.