I've run into the following issue with the Kafka server. The zkclient lib seems to die silently if there is an UnknownHostException(or any IOException) while reconnecting the ZK session. I've filed a bug about this with the zkclient lib(https://github.com/sgroschupf/zkclient/issues/23). The ramifications for Kafka were the silent loss of all ephemeral nodes associated with the affected process.
Has anyone faced this issue? If so, what is the recommended way of dealing with this?
If there is no good solution available, would the community be open to a patch that periodically verifies ZK connectivity?
Interesting. I haven't had the chance to dive into the zkclient codebase to understand the root cause yet, but since you mentioned this can cause ephemeral node loss, I am curious to know how you detected the ephemeral node loss. Did the Kafka consumer not respond to rebalance events or did the server not respond to state change events ? Also, ephemeral nodes are lost only when sessions are expired on the zookeeper server or if clients close the session actively, how does losing connection lead to ephemeral node loss?
Thanks, Neha On Mon, Sep 23, 2013 at 7:02 AM, Anatoly Fayngelerin <[EMAIL PROTECTED]>wrote:
For what it is worth I am currently looking into a problem that sounds suspiciously related. We're seeing no node exceptions for the consumer node during rebalance:
Caused by: org.apache.zookeeper.KeeperException$NoNodeException: KeeperErrorCode = NoNode for /consumers/es_consumer/ids/es_consumer_cloudalytics-preprod-app3.s1phx1.jivehosted.com-1379900787620-f067bcb7
(this is for the local consumer node) and looking at the code I was having a hard time figuring out how this happened. Sadly didn't get the logs for the period of time when the problem started, so I've been stumped so far...
On Sep 24, 2013, at 5:43 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote:
Joel - that is exactly right. ZkClient has no way to notify consumers of this situation. The session end event gets fired, however, the session begin event never occurs.
Neha - The issue manifested itself when producers were attempting to discover topics/brokers. The kafka brokers had lost their ZK sessions during a network outage. The outage was long enough for ZooKeeper to expire the sessions corresponding to the ephemeral nodes in /broker/. The zkclient bug prevented the broker from ever re-establishing the ZK session. Subsequently, no zookeeper based producer was able to discover topic->broker mappings. The resulting exceptions looked like:
Caused by: kafka.common.NoBrokersForPartitionException: Partition = null at kafka.producer.Producer.kafka$producer$Producer$getPartitionListForTopic(Producer.scala:167) at kafka.producer.Producer$anonfun$3.apply(Producer.scala:116) at kafka.producer.Producer$anonfun$3.apply(Producer.scala:105) at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233) at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34) at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:33) at scala.collection.TraversableLike$class.map(TraversableLike.scala:233) at scala.collection.mutable.WrappedArray.map(WrappedArray.scala:33) at kafka.producer.Producer.zkSend(Producer.scala:105) at kafka.producer.Producer.send(Producer.scala:99) at com.yieldmo.common.protobuf.ProtoKafkaWriter$class.write(ProtoKafka.scala:20) at com.yieldmo.common.protobuf.ProtoWriter.write(ProtoKafka.scala:40) at com.yieldmo.storm.bolt.KafkaProtoWriterBolt.execute(KafkaProtoWriterBolt.scala:48)
As far as I can see, the only way to deal with this without patching zkclient is to periodically check the status of the zk connection and try to detect this kind of situation. I would love to hear better ideas for how to handle this. On Tue, Sep 24, 2013 at 3:31 AM, Joel Koshy <[EMAIL PROTECTED]> wrote:
Thanks for explaining the bug. This is a serious issue that we should fix at the zkclient level. We have submitted patches to them before and they were pretty helpful in releasing a new version with the patch. I think that will lead to a cleaner solution than trying to get around it in Kafka code since zkclient usage is pretty wide spread across the server and consumer code today.
Thanks, Neha On Tue, Sep 24, 2013 at 8:28 AM, Anatoly Fayngelerin <[EMAIL PROTECTED]>wrote:
That does sound like a saner solution. Which github repo do you submit patches to? It looks like the repo I posted on originally( https://github.com/sgroschupf/zkclient/issues/23) might be a little stale. On Tue, Sep 24, 2013 at 11:34 AM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext