-Re: Understanding how producers and consumers behave in case of node failures in 0.8
Aniket Bhatnagar 2013-10-24, 13:25
On 24 October 2013 18:11, Neha Narkhede <[EMAIL PROTECTED]> wrote:
> Yes. And during retries, the producer and consumer refetch metadata.
> On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <[EMAIL PROTECTED]>
> > I am trying to understand and document how producers & consumers
> > will/should behave in case of node failures in 0.8. I know there are
> > various other threads that discuss this but I wanted to bring all the
> > information together in one post. This should help people building
> > producers & consumers in other languages as well. Here is my
> > of how Kafak behaves in failures:
> > Case 1: If a node fails that wasn't a leader for any partitions
> > No impact on consumers and producers
> > Case 2: If a leader node fails but another in sync node can be become a
> > leader
> > All publishing to and consumption from the partition whose leader failed
> > will momentarily stop until a new leader is elected. Producers should
> > implement retry logic in such cases (and in fact in all kinds of errors
> > from Kafka) and consumers can (depending on your use case) either
> > to other partitions after retrying decent number of times (in case you
> > fetching from partitions in round robin fashion) or keep retrying until
> > leader is available.
> > Case 3: If a leader node goes down and no other in sync nodes are
> > In this case, publishing to and consumption from the partition will halt
> > and will not resume until the faulty leader node recovers. In this case,
> > producers should fail the publish request after retrying decent number of
> > times and provide a callback to the client of the producer to take
> > corrective action. Consumers again have a choice to continue to other
> > partitions after retrying decent number of times (in case you are
> > from partitions in round robin fashion) or keep retrying until leader is
> > available. In case of latter, the entire consumer process will halt until
> > the faulty node recovers.
> > Do I have this right?