Ok, so if I initiate a controlled shutdown, in which all partitions that a
shutting down broker is leader of get transferred to another broker, why
can't part of that controlled transfer of leadership include ISR
synchronization, such that no data is lost? Is there a fundamental reason
why that is not possible? Is it it worth filing a Jira for a feature
request? Or is it technically not possible?
It is not as straightforward as it seems and it will slow down the shut
down operation furthermore (currently several zookeeper writes already slow
it down) and also increase the leader unavailability window. But keeping
the performance degradation aside, it is tricky since in order to stop
"new" data from coming in, we need to move the leaders off of the current
leader (broker being shutdown) onto some other follower. Now moving the
leader means some other follower will become the leader and as part of that
will stop copying existing data from the old leader and will start
receiving new data. What you are asking for is to insert some kind of
"wait" before the new follower becomes the leader so that the consumption
of messages is "done". What is the definition of "done" ? This is only
dictated by the log end offset of the old leader and will have to be
included in the new leader transition state change request by the
controller. So this means an extra RPC between the controller and the other
brokers as part of the leader transition. Also there is no guarantee that
the other followers are alive and consuming, so how long does the broker
being shutdown wait ? Since it is no longer the leader, it technically
cannot kick followers out of the ISR, so ISR handling is another thing that
becomes tricky here.
Also, this sort of catching up on the last few messages is not limited to
just controlled shutdown, but you can extend it to any other leader
transition. So special casing it does not make much sense. This "wait"
combined with the extra RPC will mean extending the leader unavailability
window even further than what we have today.
So this is fair amount of work to make sure last few messages are
replicated to all followers. Instead what we do is to simplify the leader
transition and let the clients handle the retries for requests that have
not made it to the desired number of replicas, which is configurable.
We can discuss that in a JIRA if that helps. May be other committers have
On Sun, Oct 6, 2013 at 10:08 AM, Jason Rosenberg <[EMAIL PROTECTED]> wrote: