Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Kafka >> mail # user >> Controlled shutdown failure, retry settings


+
Jason Rosenberg 2013-10-25, 08:18
+
Joel Koshy 2013-10-25, 14:35
+
Neha Narkhede 2013-10-25, 14:41
+
Jason Rosenberg 2013-10-25, 15:26
+
Neha Narkhede 2013-10-25, 16:39
+
Jason Rosenberg 2013-10-25, 16:51
+
Jason Rosenberg 2013-10-25, 22:22
+
Joel Koshy 2013-10-26, 01:17
+
Jason Rosenberg 2013-10-26, 03:51
+
Jason Rosenberg 2013-10-29, 20:29
+
Jason Rosenberg 2013-10-29, 20:39
+
Joel Koshy 2013-11-01, 20:43
+
Neha Narkhede 2013-11-01, 21:00
+
Jason Rosenberg 2013-11-02, 05:36
+
Jun Rao 2013-11-03, 03:41
+
Jason Rosenberg 2013-11-03, 11:24
Copy link to this message
-
Re: Controlled shutdown failure, retry settings
A replica is dropped out of ISR if (1) it hasn't issue a fetch request for
some time, or (2) it's behind the leader by some messages. The replica will
be added back to ISR if neither condition is longer true.

The actual value depends on the application. For example, if there is a
spike and the follower can't keep up, the application has to decide whether
to slow down the commit of messages or let the replicas drift apart.

Thanks,

Jun
On Sun, Nov 3, 2013 at 3:23 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote:

> Jun,
>
> Can you explain the difference between "failed" and "slow"?  In either
> case, the follower drops out of the ISR, and can come back later if they
> catch up, no?
>
> In the configuration doc, it seems to describe them both with the same
> language:  "if ....., the leader will remove the follower from ISR and
> treat it as dead."
>
> The *.max.messages setting seems to make the system somewhat susceptible to
> sudden spikes of message traffic.
>
> At first glance, the defaults seem a bit out of balance.  The default *.
> max.ms is 10 seconds, while the default *.max.messages is only 4000
> messages.   Given that we can handle 10's of thousands of messages a
> second, what is the thinking behind these defaults?
>
> Jason
>
>
> On Sat, Nov 2, 2013 at 11:41 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
>
> > replica.lag.time.max.ms is used to detect a failed broker.
> > replica.lag.max.messages is used to detect a slow broker.
> >
> > Thanks,
> >
> > Jun
> >
> >
> > On Fri, Nov 1, 2013 at 10:36 PM, Jason Rosenberg <[EMAIL PROTECTED]>
> wrote:
> >
> > > In response to Joel's point, I think I do understand that messages can
> be
> > > lost, if in fact we have dropped down to only 1 member in the ISR at
> the
> > > time the message is written, and then that 1 node goes down.
> > >
> > > What I'm not clear on, is the conditions under which a node can drop
> out
> > of
> > > the ISR.  You said:
> > >
> > > "- ISR = 0, leader = 0; (so 1 is out of the ISR - say if broker 0 is
> > > slow (but up))"
> > >
> > > Did you mean to say "if broker *1* is slow (but up)"?
> > >
> > > I assume by "slow", you mean when a follower hasn't made a fetch
> request
> > > within "replica.lag.time.max.ms"?  The default for this is 10000 ms,
> so
> > it
> > > would have to be in pretty bad shape to be "up" but "slow", no?
> > >
> > > It also does seem odd that a node can be too slow to remain in an ISR,
> > but
> > > then be made available to compete in a ISR leader election.....
> > >
> > > Jason
> > >
> > >
> > > On Fri, Nov 1, 2013 at 5:00 PM, Neha Narkhede <[EMAIL PROTECTED]
> > > >wrote:
> > >
> > > > For supporting more durability at the expense of availability, we
> have
> > a
> > > > JIRA that we will fix on trunk. This will allow you to configure the
> > > > default as well as per topic durability vs availability behavior  -
> > > > https://issues.apache.org/jira/browse/KAFKA-1028
> > > >
> > > > Thanks,
> > > > Neha
> > > >
> > > >
> > > > On Fri, Nov 1, 2013 at 1:43 PM, Joel Koshy <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > >>>>
> > > > > >>>> Unclean shutdown could result in data loss - since you are
> > moving
> > > > > >>>> leadership to a replica that has fallen out of ISR. i.e., it's
> > log
> > > > end
> > > > > >>>> offset is behind the last committed message to this partition.
> > > > > >>>>
> > > > > >>>>
> > > > > >>> But if data is written with 'request.required.acks=-1', no data
> > > > should
> > > > > be
> > > > > >>> lost, no?  Or will partitions be truncated wholesale after an
> > > unclean
> > > > > >>> shutdown?
> > > > >
> > > > > Sorry about the delayed reply to this, but it is an important
> point -
> > > > > data can be lost even in this case. -1 means ack after all replicas
> > in
> > > > > the current ISR have received the message. So for example:
> > > > > - assigned replicas for some partition = 0,1
> > > > > - ISR = 0, leader = 0; (so 1 is out of the ISR - say if broker 0 is
> > > > > slow (but up))

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB