I am trying to understand and document how producers & consumers will/should behave in case of node failures in 0.8. I know there are various other threads that discuss this but I wanted to bring all the information together in one post. This should help people building producers & consumers in other languages as well. Here is my understanding of how Kafak behaves in failures:
Case 1: If a node fails that wasn't a leader for any partitions No impact on consumers and producers
Case 2: If a leader node fails but another in sync node can be become a leader All publishing to and consumption from the partition whose leader failed will momentarily stop until a new leader is elected. Producers should implement retry logic in such cases (and in fact in all kinds of errors from Kafka) and consumers can (depending on your use case) either continue to other partitions after retrying decent number of times (in case you are fetching from partitions in round robin fashion) or keep retrying until leader is available.
Case 3: If a leader node goes down and no other in sync nodes are available In this case, publishing to and consumption from the partition will halt and will not resume until the faulty leader node recovers. In this case, producers should fail the publish request after retrying decent number of times and provide a callback to the client of the producer to take corrective action. Consumers again have a choice to continue to other partitions after retrying decent number of times (in case you are fetching from partitions in round robin fashion) or keep retrying until leader is available. In case of latter, the entire consumer process will halt until the faulty node recovers.
and will not resume until the faulty leader node recovers
Can you confirm that's the case? I think they won't wait until leader recovered and will try to elect new leader from existing non-ISR replicas? And in case if they wait, and faulty leader never comes back? On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar < [EMAIL PROTECTED]> wrote:
Yes, when a leader dies, the preference is to pick a leader from the ISR. If not, the leader is picked from any other available replica. But if no replicas are alive, the partition goes offline and all production and consumption halts, until at least one replica is brought online.
Thanks, Neha On Thu, Oct 24, 2013 at 11:57 AM, Kane Kane <[EMAIL PROTECTED]> wrote:
I am planning to use kafka 0.8 spout and after studing the source code found that it doesnt handle errors. There is a fork that adds try catch over using fetchResponse but my guess is this will lead to spout attempting the same partition infinitely until the leader is elected/comes back online. I will try and submit a pull request to kafka 0.8 plus spout to fix this issue in a couple of days. On 25 Oct 2013 02:58, "Chris Bedford" <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext