I am currently running a deployment with 3 brokers, 3 ZK, 3 producers, 2 consumers, and 15 topics. I should first point out that this is my first project using Kafka ;). The issue I'm seeing is that the consumers are only processing about 15 messages per second from what should be the largest topic it is consuming (we're sending 200-400 ~300 byte messages per second to this topic). I should note that I'm using a high level ZK consumer and ZK 3.4.3.
I have a strong feeling I have not configured things properly so I could definitely use some guidance. Here is my broker configuration:
Thanks very much for the reply Neha! So I swapped out the consumer that processes the messages with one that just prints them. It does indeed achieve a much better rate at peaks but can still nearly zero out (if not completely zero out). I plotted the messages printed in graphite to show the behaviour I'm seeing (this is messages printed per second):
The peaks are over ten thousand per second and the troughs can go below 10 per second just prior to another peak. I know that there are plenty of messages available because the ones currently being processed are still from Friday afternoon, so this may or may not have something to do with this pattern.
Is there anything I can do to avoid the periods of lower performance? Ideally I would be processing messages as soon as they are written. On Sun, Apr 21, 2013 at 8:49 AM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
Hmm it is highly unlikely that that is the culprit... There is lots of bandwidth available for me to use. I will definitely keep that in mind though. I was working on this today and have some tidbits of additional information and thoughts that you might be able to shed some light on:
- I mentioned I have 2 consumers, but each consumer is running with 8 threads for this topic (and each consumer has 8 cores available). - When I initially asked for help the brokers were configured with num.partitions=1, I've since tried higher numbers (3, 64) and haven't seen much of an improvement aside from forcing both consumer apps to handle messages (with the overall performance not changing much). - I ran into this article http://riccomini.name/posts/kafka/2012-10-05-kafka-consumer-memory-tuning/and tried a variety of options of queuedchunks.max and fetch.size with no significant results (simply meaning it did not achieve the goal of me constantly processing hundreds or thousands of messages per second, which is similar to the rate of input). I would not be surprised if I'm wrong but this made me start to think that the problem may lie outside of the consumers - Would the combination of a high number of partitions (64) and a high log.flush.interval (10k) prevent logs from flushing as often as they need to for my desired rate of consumption (even with log.default.flush.interval.ms=1000?)
Despite the changes I mentioned the behaviour is still the consumers receiving larger spikes of messages mixed with periods of complete inactivity and overall a long delay between messages being written and messages being read (about 2 minutes). Anyway... as always I greatly appreciate any help.
On Sun, Apr 21, 2013 at 8:50 PM, Jun Rao <[EMAIL PROTECTED]> wrote:
Oh... and at this point I'm talking about consumers that do no processing and don't even produce any output. They simply send udp packets to graphite. On Mon, Apr 22, 2013 at 9:13 PM, Andrew Neilson <[EMAIL PROTECTED]>wrote:
Thanks Jun, your suggestion helped me quite a bit.
Since earlier this week I've been able to work out the issues (at least it seems like it for now). My consumer is now roughly processing messages at the rate they are being produced with an acceptable amount of lag end to end. Here is an overview of the issues I had. Let me know if the way I resolved things makes sense:
- many serialization errors in the producers. Fixing these eliminated what were previously perceived as lost or delayed messages. - one of the producers was not accessible through the VIP we were sending messages to. There was also a bug in the healthcheck that caused netscaler to drop one of the producers. Both of these contributed to sending too many messages to one producer which filled up the blocking queues. - I had to increase queue.size on the producers several times (currently at 320k). This may now be unnecessarily high given my next point - Increased batch.size on the producers several times. The last increase (batch.size=1600) is what finally got things going at the rate I am happy with. - Decreased num.partitions and log.flush.interval on the brokers from 64/10k to 32/100 in order to lower the average flush time (we were previously always hitting the default flush interval since no partitions ever accumulated 10k messages). The flush times are currently < 100ms (not sure if this is too low but everything seems to be working). The avg flush time was previously 1 second. - Increased fetch.size and queuedchunks.max on the consumers several times and ended at 80MB/100k. This was before I made a bunch of the changes on the producer side so these may be unnecessarily high as well.
Once again, thanks for all of the help. I'm curious to know which if any of the changes I made were unnecessary.
Andrew On Tue, Apr 23, 2013 at 7:53 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
The only other thing being written to these disks is log4j (kafka.out), so technically it is not dedicated to the data logs. The disks are 250GB SATA. On Fri, Apr 26, 2013 at 6:35 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
Andrew Neilson 2013-04-27, 01:50
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext