We've been seeing a problem with our zookeeper servers lately, where
all of a sudden a session loses some of the watchers registered on
some of the znodes. Let me explain our Kafka-ZK setup. We have a Kafka
cluster in one DC establishing sessions (with 6sec timeout) with a ZK
cluster (of 4 machines) in another DC and registers watchers on some
zookeeper paths. Every couple of weeks, we observe some problem with
the Kafka servers, where on investigating further, we find that the
session lost some of the key watches, but not all.
The last time this happened, we ran the wchc command on the ZK servers
and saw the problem. Unfortunately, we lost relevant information from
the ZK logs by the time we were ready to debug it further. Since this
causes Kafka servers to stop making progress, we want to setup some
kind of alert when this happens. This will help us collect more
information to give you. Particularly, we were thinking about running
wchp periodically (maybe once a minute), grepping for the ZK paths and
counting the number of watches that should be registered for correct
operation. But I observed that the watcher info is not replicated
across all ZK servers, so we would have to query every ZK server to
inorder to get the full list.
I'm not sure running wchp periodically on all ZK servers is the best
option for this alert. Can you think of what could be the problem here
and how we can setup this alert for now ?