This has happened a couple of times to me now in the past month, and I’m not entirely sure of the cause, although I have a suspicion.
Early this morning (UTC), it looks like one of my two brokers (id 21) lost its connection to Zookeeper for a very short period of time. This caused the second broker (id 22) to quickly become the leader for all partitions. Once broker 21 was able to re-establish its Zookeeper connection, it noticed that it has a stale list for the ISR, got its updated list, and started replicating from broker 22 for all partitions. Broker 21 then quickly rejoined the ISR, but annoyingly (but expectedly), broker 22 remained the leader. All of this happened in under a minute.
I’m wondering if https://issues.apache.org/jira/browse/KAFKA-766 is related. The current batch size on our producers is 6000 msgs or 1000 ms (I’ve been meaning to reduce this). We do about 6000 msgs per second / per producer, and have 10 partitions in this relevant topic. A couple of days ago, we noticed flapping ISR Shrink/Expand logs, so I upped replica.lag.max.messages to 10000, so that it would surely be above our batch size. I still occasionally see flapping ISR Shrinks/Expands, but hope that when I reduce the producer batch size, I will stop seeing these.
Anyway, I’m not entirely sure what happened here. Could flapping ISRs potentially cause this?
Yeah, I can do that, but I’d prefer if the first broker didn’t drop out of the ISR in the first place. Just trying to figure out why it did… On Feb 21, 2014, at 11:30 PM, Jun Rao <[EMAIL PROTECTED]> wrote: