The blog also talks earlier about spurious rebalances, due to improper GC settings, but we couldn't find what GC settings to use. We are considering changing the zookeeper timeouts. We are a little confused about the various issues, the sequence of issues and what could cause the consumers to stop reading. If the fetchers get shutdown, due to a ClosedByInterruptException in the "leader_finder" thread, which tells the "executor_watcher" thread to shutdown the fetchers, that would be another reason the consumers stop processing data. Is this possible?
If a consumer rebalances for any reason (e.g., if a consumer in the group has a soft failure such as a long GC) then the fetchers are stopped as part of the rebalance process. The sequence is as follows:
- Stop fetchers - Commit offsets - Release partition ownership - Rebalance (i.e., figure out what partitions this consumer should now consume with the updated set of consumers) - Acquire partition ownership - Add fetchers to those partitions and resume consumption
i.e., rebalances should complete successfully and fetching should resume. If you have any rebalance failures (search for "can't rebalance after") then the consumer will effectively stop.
From later in this thread it seems your consumer somehow got into a weird state in zookeeper, so your only recourse at this point may be to stop all your consumers and restart.
From the logs it seems the consumer 562b6738's registry node in Zookeeper has lost:
NoNode for /consumers/account-activated-hadoop-consumer/ids/account-activated-hadoop-consumer_tm1mwdpl04-1389222557906-562b6738
As Joel suggested for now you may just stop all your consumers and restart, to debug you may need to investigate into zookeeper's log checking is there any session expiration or close socket events happened which cause ZK to delete the registry node.
Which Kafka version are you using?
Guozhang On Fri, Jan 10, 2014 at 3:36 PM, Joel Koshy <[EMAIL PROTECTED]> wrote:
First you want to find out the reason for the frequent zookeeper session expirations. Few common causes -
1. Client side GC (turn on JVM GC logs to validate this) 2. Long I/O pauses on the zookeeper servers (grep for fsync in the zookeeper log4j log) 3. GC on the zookeeper server (again, turn on JVM GC logs to validate this)
You would want to fix the root cause of session expirations. I'm not sure yet if you are hitting KAFKA-992 or not.
Thanks, Neha On Sun, Jan 12, 2014 at 7:21 PM, Guozhang Wang <[EMAIL PROTECTED]> wrote:
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext