-Discussion on supporting a large number of clients for a zk ensemble
I wanted to start a discussion on how we can support a large number of clients in zookeeper. I am at facebook and we are using zookeeper for quite a few projects. There are a couple of projects where we are designing for a large number of clients. The projects are
1. Building a directory service for holding configuration information (lookup table for which node to go to for a given key).
2. For HDFS clients, where clients lookup zookeeper for the current namenode
This information changes infrequently and is small, so update rate or size of data is not an issue.
The key challenge is to support that large a number of clients (30K to start with, but eventually could be 100K). A big chunk of the clients can try to connect/disconnect at the same time - so herd effect can happen.
I was trying out a 3 node ensemble. I noticed that with about 20K clients, there we quite a few session expires and disconnects.
I looked through the code briefly and since all the pings are eventually handled by the leader, my guess is that the leader thread is not keeping up. I haven't yet do the instrumentation/tracing to validate this.
I have been thinking about how to improve this and thought of the following solution. I am trying to hit 2 goals with this.
1. Make it possible to have a very large number of clients (each client has a watch) without losing connections too often.
2. Improve how quickly a large number of clients can connect.
1. The idea is to introduce a new type of session - "local" session. A "local" session doesn't have a full functionality of a normal session.
2. Local sessions cannot create ephemeral nodes.
3. Once a local session is lost, you cannot re-establish it using the session-id/password. The session and its watches are gone for good.
4. When a local session connects, the session info is only maintained on the zookeeper server that it is connected to. The leader is not aware of the creation of such a session and there is no state written to disk.
5. The pings and expiration is handled by the server that the session is connected to.
With the above changes, it should be easy to scale ZK by adding more learners, which manage the "local" sessions independently. Also, the rate at which you can establish "local" sessions, would be significantly higher than the normal sessions.
Would like to stir up a discussion on whether this is the best way to achieve these goals or if I am missing simpler ways of accomplishing this.