Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> producer exceptions when broker dies


Copy link to this message
-
Re: producer exceptions when broker dies
Perhaps, it's a matter of semantics.

But, I think I'm not talking only about failure, but normal operation.
 It's normal to take a cluster down for maintenance, or code update.  And
this should be done a rolling restart manner (1 server at a time).

The reason for replication, is to increase reliability (as well as
availability).  We've said that in 0.8.0, we can tolerate R-1 node failures
(where R is the replication factor).  From the client's standpoint, this
should mean that the cluster is still available and reliable when a single
node is down (assuming R is > 1).

When you say "at least once", you are suggesting that the message will be
delivered at least once, and won't be lost.

I don't think that was ever really true about the previous version <= 0.7
(and I don't think I ever really read that about 0.7, as a guarantee, did
I?).

Jason
On Sat, Oct 26, 2013 at 8:52 PM, Guozhang Wang <[EMAIL PROTECTED]> wrote:

> Hello Jason,
>
> You are right. I think we just have different definitions about "at least
> once". What you have described to me is more related to "availability",
> which says that your message will not be lost when there are failures. And
> we achieve this through replication (which is related to
> request.required.acks). In 0.7, we do not have any replications and hence
> when a broker goes down all the partitions it holds will be non-available,
> but we still defined Kafka as a "at least once" messaging system.
>
> Guozhang
>
>
> On Sat, Oct 26, 2013 at 9:32 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:
>
> > Guozhang,
> >
> > It turns out this is not entirely true, you do need
> request.required.acks =
> > -1 (and yes, you need to retry if failure) in order have guaranteed
> > delivery.
> >
> > I discovered this, when doing tests with rolling restarts (and hard
> > restarts) of the kafka servers.  If the server goes down, e.g. if there's
> > any change in the ISR leader for a partition, then the only way you can
> be
> > sure what you produced will be available on a newly elected leader is to
> > use -1.
> >
> > Jason
> >
> >
> >
> >
> > On Sat, Oct 26, 2013 at 12:11 PM, Guozhang Wang <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Jason,
> > >
> > > Setting request.required.acks=-1 is orthogonal to the 'at least once'
> > > guarantee, it only relates to the latency/replication trade-off. For
> > > example, even if you set request.required.acks to 1, and as long as you
> > > retry on all non-fatal exceptions you have the "at least once"
> guarantee;
> > > and even if you set request.required.acks to -1 and you do not retry,
> you
> > > will not get "at least once".
> > >
> > > As you said, setting request.required.acks=-1 only means that when a
> > > success response has been returned you know that the message has been
> > > received by all ISR replicas, but this has nothing to do with "at least
> > > once" guarantee.
> > >
> > > Guozhang
> > >
> > >
> > > On Fri, Oct 25, 2013 at 8:55 PM, Jason Rosenberg <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > Just to clarify, I think in order to get 'at least once' guarantees,
> > you
> > > > must produce messages with 'request.required.acks=-1'.  Otherwise,
> you
> > > > can't be 100% sure the message was received by all ISR replicas.
> > > >
> > > >
> > > > On Fri, Oct 25, 2013 at 9:56 PM, Kane Kane <[EMAIL PROTECTED]>
> > > wrote:
> > > >
> > > > > Thanks Guozhang, it makes sense if it's by design. Just wanted to
> > > ensure
> > > > > i'm not doing something wrong.
> > > > >
> > > > >
> > > > > On Fri, Oct 25, 2013 at 5:57 PM, Guozhang Wang <[EMAIL PROTECTED]
> >
> > > > wrote:
> > > > >
> > > > > > As we have said, the timeout exception does not actually mean the
> > > > message
> > > > > > is not committed to the broker. When message.send.max.retries is
> 0
> > > > Kafka
> > > > > > does guarantee "at-most-once" which means that you will not have
> > > > > > duplicates, but not means that all your exceptions can be treated
> > as
> > > > > > "message not delivered". In your case, 1480 - 1450 = 30 messages