Thanks Jun, your suggestion helped me quite a bit.
Since earlier this week I've been able to work out the issues (at least it
seems like it for now). My consumer is now roughly processing messages at
the rate they are being produced with an acceptable amount of lag end to
end. Here is an overview of the issues I had. Let me know if the way I
resolved things makes sense:
- many serialization errors in the producers. Fixing these eliminated
what were previously perceived as lost or delayed messages.
- one of the producers was not accessible through the VIP we were
sending messages to. There was also a bug in the healthcheck that caused
netscaler to drop one of the producers. Both of these contributed to
sending too many messages to one producer which filled up the blocking
- I had to increase queue.size on the producers several times (currently
at 320k). This may now be unnecessarily high given my next point
- Increased batch.size on the producers several times. The last increase
(batch.size=1600) is what finally got things going at the rate I am happy
- Decreased num.partitions and log.flush.interval on the brokers from
64/10k to 32/100 in order to lower the average flush time (we were
previously always hitting the default flush interval since no partitions
ever accumulated 10k messages). The flush times are currently < 100ms (not
sure if this is too low but everything seems to be working). The avg flush
time was previously 1 second.
- Increased fetch.size and queuedchunks.max on the consumers several
times and ended at 80MB/100k. This was before I made a bunch of the changes
on the producer side so these may be unnecessarily high as well.
Once again, thanks for all of the help. I'm curious to know which if any of
the changes I made were unnecessary.
On Tue, Apr 23, 2013 at 7:53 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
> You can run kafka.tools.ConsumerOffsetChecker to check the consumer lag. If
> the consumer is lagging, this indicates a problem on the consumer side.
> On Mon, Apr 22, 2013 at 9:13 PM, Andrew Neilson <[EMAIL PROTECTED]
> > Hmm it is highly unlikely that that is the culprit... There is lots of
> > bandwidth available for me to use. I will definitely keep that in mind
> > though. I was working on this today and have some tidbits of additional
> > information and thoughts that you might be able to shed some light on:
> > - I mentioned I have 2 consumers, but each consumer is running with 8
> > threads for this topic (and each consumer has 8 cores available).
> > - When I initially asked for help the brokers were configured with
> > num.partitions=1, I've since tried higher numbers (3, 64) and haven't
> > seen
> > much of an improvement aside from forcing both consumer apps to handle
> > messages (with the overall performance not changing much).
> > - I ran into this article
> > tried a variety of options of queuedchunks.max and fetch.size with no
> > significant results (simply meaning it did not achieve the goal of me
> > constantly processing hundreds or thousands of messages per second,
> > which
> > is similar to the rate of input). I would not be surprised if I'm
> > but
> > this made me start to think that the problem may lie outside of the
> > consumers
> > - Would the combination of a high number of partitions (64) and a high
> > log.flush.interval (10k) prevent logs from flushing as often as they
> > need
> > to for my desired rate of consumption (even with
> > log.default.flush.interval.ms=1000?)
> > Despite the changes I mentioned the behaviour is still the consumers
> > receiving larger spikes of messages mixed with periods of complete
> > inactivity and overall a long delay between messages being written and
> > messages being read (about 2 minutes). Anyway... as always I greatly