We have a small Kafka cluster (0.7.1 - 3 nodes) in EC2. The load is about 200 million events per day, each being few kilobytes. We have a single node zookeeper.
Yesterday suddenly our Kafka clients started throwing the following exception: java.lang.RuntimeException: kafka.common.ConsumerRebalanceFailedException: CONSUMER_GROUP_NAME_ip-00-00-00-00.ec2.internal-1373821190828-5f78e9af can't rebalance after 4 retries at com.gumgum.kafka.consumer.KafkaTemplate.executeWithBatch(KafkaTemplate.java:59) at com.gumgum.storm.fileupload.GenericKafkaSpout.nextTuple(GenericKafkaSpout.java:73) at backtype.storm.daemon.executor$fn__3968$fn__4009$fn__4010.invoke(executor.clj:433) at backtype.storm.util$async_loop$fn__465.invoke(util.clj:377)
None of the Kafka clients (ConsumerConenctor class) would start. They would fail with the exception.
We tried restarting the clilents, restarting the zookeeper as well. But finally it all started working when we restarted all of our kafka brokers. We didn't lose any data because producers (going directly to the brokers through a load balancer) were working fine.
I tried googling this issue and looks like lot of people have faced it, but couldn't get anything concrete.
Given this, I have two questions:
It will be nice if you can tell me why this can happen or point me to a link where I can understand it better. What does Consumer Rebalancing mean? Does that mean consumers are trying to coordinate amongst themselves using Zookeeper?
On a separate note, are there any JMX parameters I need to be monitoring to make sure that my kafka cluster is healthy? How can I keep watch on my kafka cluster?
Regards, Vaibhav Puranik GumGum
Vaibhav Puranik 2013-07-15, 17:15
ConsumerRebalanceFailedException when broker unavailable
I'm going through a series of tests with Kafka 0.8 to see what the failure cases are. One simple scenario is this:
1. Launch Zookeeper 2. Launch my consumer (written in scala) but don't launch a Kafka broker (I am not using replication in my configuration).
What happens in this particular scenario is that my consumer hits the createMessageStreams which internally makes 4 failed attempts and then it raises a ConsumerRebalanceFailedException:
kafka.common.ConsumerRebalanceFailedException: zfilter_vm-florin-ubuntu-d-12-1373675347482-ff93b841 can't rebalance after 4 retries at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:397) at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:681) at kafka.consumer.ZookeeperConsumerConnector.consume(ZookeeperConsumerConnector.scala:205) at kafka.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:133) at kafka.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:126) I was expecting a more specific exception - something like NoBrokerAvailable instead of this. I can add a catch clause for this exception and retry the operation in a loop until the broker comes online but I dislike the approach because I am making an assumption here - that ConsumerRebalanceFailedException means "no broker available". Is there a better way to do this? Do you have a pattern for consumers for when brokers go offline and come back up?
Thanks a lot!
Florin Trofin 2013-07-13, 00:56
Re: ConsumerRebalanceFailedException when broker unavailable
Yes - rebalance => consumers trying to coordinate through ZK. Rebalances can happen when one or more of the following happen: - a consumed topic partition appears or disappears - i.e., if a broker comes or goes. - a consumer instance in the group comes or goes "goes" could also be triggered by session expirations in zookeeper - typically caused by client-side GC or flaky connections to zookeeper.
On Mon, Jul 15, 2013 at 10:15 AM, Vaibhav Puranik <[EMAIL PROTECTED]> wrote:
From a user's perspective, ConsumerRebalanceException is a bit cryptic -I think the other thread was to provide a more informative message and also be able to recover when a broker does come up (fixed in KAFKA-969).
On Tue, Jul 16, 2013 at 11:04 AM, Vaibhav Puranik <[EMAIL PROTECTED]> wrote:
What is the full stack trace? if you see "can't rebalance after 4 retries" then likely the problem is the broker is down or not available
/******************************************* Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop <http://www.twitter.com/allthingshadoop> ********************************************/ On Fri, Nov 29, 2013 at 11:31 AM, Yu, Libo <[EMAIL PROTECTED]> wrote:
Joe Stein 2013-11-29, 16:57
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext