I've been having this issue consistently since I first started this thread,
but it was happening infrequently enough for me to brush it aside and just
run an election to rebalance brokers.
I recently expanded (and reinstalled) our Kafka cluster so that it now has
4 brokers with a default replication factor of 3 for each partition. I
also switched over to the G1GC as recommended here:https://kafka.apache.org/081/ops.html
(even though we are still running
Kafka 0.8.0; we hope to upgrade soon).
Now, only one of the 4 brokers (analytics1021, the same problem broker we
saw before) gets its ZK connection expired even more frequently.
Previously it was less than once a week, now I am seeing this happen
multiple times a day.
I've posted all the relevant logs from a recent event here:https://gist.github.com/ottomata/e42480446c627ea0af22
This includes the GC log on the offending Kafka broker during the time this
happened. I am pretty green when it comes to GC tuning, but I do see this
[Times: user=0.14 sys=0.00, real=11.47 secs]
Did Kafka's JVM really just take 11.47 secs to do a GC there? I'm
probably missing something, but I don't see which part of that real
time summary makes up the bulk of that GC time
This is strange, riight? This broker is identically configured to all
its peers, and should be handling on average the exact same amount and
type of traffic. Anyone have any advice?
On Fri, Mar 21, 2014 at 6:48 PM, Neha Narkhede <[EMAIL PROTECTED]>