Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # dev >> zkclient dies after UnknownHostException in zk reconnect


Copy link to this message
-
Re: zkclient dies after UnknownHostException in zk reconnect
Thanks for explaining the bug. This is a serious issue that we should fix
at the zkclient level. We have submitted patches to them before and they
were pretty helpful in releasing a new version with the patch. I think that
will lead to a cleaner solution than trying to get around it in Kafka code
since zkclient usage is pretty wide spread across the server and consumer
code today.

Thanks,
Neha
On Tue, Sep 24, 2013 at 8:28 AM, Anatoly Fayngelerin <[EMAIL PROTECTED]>wrote:

> Joel - that is exactly right. ZkClient has no way to notify consumers of
> this situation. The session end event gets fired, however, the session
> begin event never occurs.
>
> Neha - The issue manifested itself when producers were attempting to
> discover topics/brokers. The kafka brokers had lost their ZK sessions
> during a network outage. The outage was long enough for ZooKeeper to expire
> the sessions corresponding to the ephemeral nodes in /broker/. The zkclient
> bug prevented the broker from ever re-establishing the ZK session.
> Subsequently, no zookeeper based producer was able to discover
> topic->broker mappings. The resulting exceptions looked like:
>
> Caused by: kafka.common.NoBrokersForPartitionException: Partition = null
> at
>
> kafka.producer.Producer.kafka$producer$Producer$getPartitionListForTopic(Producer.scala:167)
> at kafka.producer.Producer$anonfun$3.apply(Producer.scala:116)
> at kafka.producer.Producer$anonfun$3.apply(Producer.scala:105)
> at
>
> scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233)
> at
>
> scala.collection.TraversableLike$anonfun$map$1.apply(TraversableLike.scala:233)
> at
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:34)
> at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:33)
> at scala.collection.TraversableLike$class.map(TraversableLike.scala:233)
> at scala.collection.mutable.WrappedArray.map(WrappedArray.scala:33)
> at kafka.producer.Producer.zkSend(Producer.scala:105)
> at kafka.producer.Producer.send(Producer.scala:99)
> at
>
> com.yieldmo.common.protobuf.ProtoKafkaWriter$class.write(ProtoKafka.scala:20)
> at com.yieldmo.common.protobuf.ProtoWriter.write(ProtoKafka.scala:40)
> at
>
> com.yieldmo.storm.bolt.KafkaProtoWriterBolt.execute(KafkaProtoWriterBolt.scala:48)
>
> As far as I can see, the only way to deal with this without patching
> zkclient is to periodically check the status of the zk connection and try
> to detect this kind of situation. I would love to hear better ideas for how
> to handle this.
>
>
> On Tue, Sep 24, 2013 at 3:31 AM, Joel Koshy <[EMAIL PROTECTED]> wrote:
>
> > > node loss. Did the Kafka consumer not respond to rebalance events or
> did
> > > the server not respond to state change events ? Also, ephemeral nodes
> are
> > > lost only when sessions are expired on the zookeeper server or if
> clients
> > > close the session actively, how does losing connection lead to
> ephemeral
> > > node loss?
> >
> > My understanding of Anatoly's observation is that on session
> > expiration, zkclient will reconnect
> > (
> >
> https://github.com/sgroschupf/zkclient/blob/master/src/main/java/org/I0Itec/zkclient/ZkClient.java#L458
> > )
> > but if the connect causes an IOException, that would effectively mean
> > that the session will not get re-established. Anatoly, can you
> > confirm?
> >
> > > On Mon, Sep 23, 2013 at 7:02 AM, Anatoly Fayngelerin <
> [EMAIL PROTECTED]
> > >wrote:
> > >
> > >> Hi Everyone,
> > >>
> > >> I've run into the following issue with the Kafka server. The zkclient
> > lib
> > >> seems to die silently if there is an UnknownHostException(or any
> > >> IOException) while reconnecting the ZK session. I've filed a bug about
> > this
> > >> with the zkclient lib(
> https://github.com/sgroschupf/zkclient/issues/23
> > ).
> > >> The
> > >> ramifications for Kafka were the silent loss of all ephemeral nodes
> > >> associated with the affected process.
> > >>
> > >> Has anyone faced this issue? If so, what is the recommended way of