Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Understanding how producers and consumers behave in case of node failures in 0.8


Copy link to this message
-
Re: Understanding how producers and consumers behave in case of node failures in 0.8
Aniket Bhatnagar 2013-10-25, 01:40
I am planning to use kafka 0.8 spout and after studing the source code
found that it doesnt handle errors. There is a fork that adds try catch
over using fetchResponse but my guess is this will lead to spout attempting
the same partition infinitely until the leader is elected/comes back
online. I will try and submit a pull request to kafka 0.8 plus spout to fix
this issue in a couple of days.
 On 25 Oct 2013 02:58, "Chris Bedford" <[EMAIL PROTECTED]> wrote:

> Thank you for posting these guidelines.   I'm wondering if anyone out there
> that is using the Kafka spout (for Storm) knows whether or not the Kafka
> spout takes care of these types of details ?
>
>  regards
>  -chris
>
>
>
> On Thu, Oct 24, 2013 at 2:05 PM, Neha Narkhede <[EMAIL PROTECTED]
> >wrote:
>
> > Yes, when a leader dies, the preference is to pick a leader from the ISR.
> > If not, the leader is picked from any other available replica. But if no
> > replicas are alive, the partition goes offline and all production and
> > consumption halts, until at least one replica is brought online.
> >
> > Thanks,
> > Neha
> >
> >
> > On Thu, Oct 24, 2013 at 11:57 AM, Kane Kane <[EMAIL PROTECTED]>
> wrote:
> >
> > > >>publishing to and consumption from the partition will halt
> > > and will not resume until the faulty leader node recovers
> > >
> > > Can you confirm that's the case? I think they won't wait until leader
> > > recovered and will try to elect new leader from existing non-ISR
> > replicas?
> > > And in case if they wait, and faulty leader never comes back?
> > >
> > >
> > > On Thu, Oct 24, 2013 at 6:24 AM, Aniket Bhatnagar <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > > > Thanks Neha
> > > >
> > > >
> > > > On 24 October 2013 18:11, Neha Narkhede <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Yes. And during retries, the producer and consumer refetch
> metadata.
> > > > >
> > > > > Thanks,
> > > > > Neha
> > > > > On Oct 24, 2013 3:09 AM, "Aniket Bhatnagar" <
> > > [EMAIL PROTECTED]>
> > > > > wrote:
> > > > >
> > > > > > I am trying to understand and document how producers & consumers
> > > > > > will/should behave in case of node failures in 0.8. I know there
> > are
> > > > > > various other threads that discuss this but I wanted to bring all
> > the
> > > > > > information together in one post. This should help people
> building
> > > > > > producers & consumers in other languages as well. Here is my
> > > > > understanding
> > > > > > of how Kafak behaves in failures:
> > > > > >
> > > > > > Case 1: If a node fails that wasn't a leader for any partitions
> > > > > > No impact on consumers and producers
> > > > > >
> > > > > > Case 2: If a leader node fails but another in sync node can be
> > > become a
> > > > > > leader
> > > > > > All publishing to and consumption from the partition whose leader
> > > > failed
> > > > > > will momentarily stop until a new leader is elected. Producers
> > should
> > > > > > implement retry logic in such cases (and in fact in all kinds of
> > > errors
> > > > > > from Kafka) and consumers can (depending on your use case) either
> > > > > continue
> > > > > > to other partitions after retrying decent number of times (in
> case
> > > you
> > > > > are
> > > > > > fetching from partitions in round robin fashion) or keep retrying
> > > until
> > > > > > leader is available.
> > > > > >
> > > > > > Case 3: If a leader node goes down and no other in sync nodes are
> > > > > available
> > > > > > In this case, publishing to and consumption from the partition
> will
> > > > halt
> > > > > > and will not resume until the faulty leader node recovers. In
> this
> > > > case,
> > > > > > producers should fail the publish request after retrying decent
> > > number
> > > > of
> > > > > > times and provide a callback to the client of the producer to
> take
> > > > > > corrective action. Consumers again have a choice to continue to
> > other
> > > > > > partitions after retrying decent number of times (in case you are