-Re: Segment recovery and replication
Neha Narkhede 2013-08-29, 21:12
>> How do you automate waiting for the broker to come up? Just keep
monitoring the process and keep trying to connect to the port?
Every leader in a Kafka cluster exposes the UnderReplicatedPartitionCount
metric. The safest way to issue controlled shutdown is to wait until that
metric reports 0 on the brokers. If you try to shutdown the last broker in
the ISR, the controlled shutdown cannot succeed since there is no other
broker to move the leader to. Waiting until under replicated partition
count hits 0 prevents you from hitting this issue.
This also solves the problem of waiting until the broker comes up since you
will automatically wait until the broker comes up and joins ISR.
On Thu, Aug 29, 2013 at 12:59 PM, Sam Meder <[EMAIL PROTECTED]>wrote:
> Ok, I spent some more time staring at our logs and figured out that it was
> our fault. We were not waiting around for the Kafka broker to fully
> initialize before moving on to the next broker and loading the data logs
> can take quite some time (~7 minutes in one case), so we ended up with no
> replicas online at some point and the replica that came back first was a
> little short on data...
> How do you automate waiting for the broker to come up? Just keep
> monitoring the process and keep trying to connect to the port?
> On Aug 29, 2013, at 6:40 PM, Sam Meder <[EMAIL PROTECTED]> wrote:
> > On Aug 29, 2013, at 5:50 PM, Sriram Subramanian <
> [EMAIL PROTECTED]> wrote:
> >> Do you know why you timed out on a regular shutdown?
> > No, though I think it may just have been that the timeout we put in was
> too short.
> >> If the replica had
> >> fallen off of the ISR and shutdown was forced on the leader this could
> >> happen.
> > Hmm, but it shouldn't really be made leader if it isn't even in the isr,
> should it?
> > /Sam
> >> With ack = -1, we guarantee that all the replicas in the in sync
> >> set have received the message before exposing the message to the
> >> On 8/29/13 8:32 AM, "Sam Meder" <[EMAIL PROTECTED]> wrote:
> >>> We've recently come across a scenario where we see consumers resetting
> >>> their offsets to earliest and which as far as I can tell may also lead
> >>> data loss (we're running with ack = -1 to avoid loss). This seems to
> >>> happen when we time out on doing a regular shutdown and instead kill -9
> >>> the kafka broker, but does obviously apply to any scenario that
> >>> a unclean exit. As far as I can tell what happens is
> >>> 1. On restart the broker truncates the data for the affected
> >>> i.e. not all data was written to disk.
> >>> 2. The new broker then becomes a leader for the affected partitions and
> >>> consumers get confused because they've already consumed beyond the now
> >>> available offset.
> >>> Does that seem like a possible failure scenario?
> >>> /Sam