We are starting to use Kafka in production but we found an unexpected (at least for me) behavior with the use of partitions. We have a bunch of topics with a few partitions each. We try to consume all data from several consumers (just one consumer group).
The problem is in the rebalance step. The rebalance splits the partitions per topic between all consumers. So if you have 100 topics but only 2 partitions each and 10 consumers only two consumers will be used. That is, for each topic all partitions will be listed and shared between the consumers in the consumer group in order (not randomly).
This behavior is also described in algorithm 1 of the original kafka paper .
I don't understand this decision. Why is split by topic? Does it make sense to divide all partitions from all topics between all the consumers in the consumer group? I don't see the reason of this so I would like to hear your opinion before changing the code.
Currently, partition is the smallest unit that we distribute data among consumers (in the same consumer group). So, if the # of consumers is larger than the total number of partitions in a Kafka cluster (across all brokers), some consumers will never get any data. Such a decision is done on a per topic basis. If a consumer consumes multiple topics, it would make sense to divide partitions across all topics to consumers. We haven't done that yet. Part of the reason is that we need to figure out how to balance the data across topics since they can be of different sizes. We can look into that post 0.8.
For now, the solution is to increase the number of partitions on the broker.
On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González < [EMAIL PROTECTED]> wrote:
That is a good suggestion. Ideally, the partitions across all topics should be distributed evenly across consumer streams instead of a per-topic based decision. There is no particular advantage to the current scheme of per-topic rebalancing that I can think of. Would you mind filing a JIRA to track this improvement ?
Thanks, Neha On Mon, Jan 7, 2013 at 9:10 AM, Jun Rao <[EMAIL PROTECTED]> wrote:
I was trying to avoid adding more partitions. I have enough partitions if you count all partitions in all topics. I understand the problem with different data load per topic but the current schema does not solve this problem either so we shouldn't be worse is we consider all partitions from all topics at the same time.
(From http://kafka.apache.org/design.html) one potential benefit of the existing rebalancing logic is to reduce the number of connections to brokers per consumer instance. However, if you have a large number of partitions and few brokers and/or consumer instances then it wouldn't really help; so I agree it would be good to implement KAFKA-687. KAFKA-564<https://issues.apache.org/jira/browse/KAFKA-564> may also be related - i.e., it may be easier to implement along with/after KAFKA-687,
Joel On Mon, Jan 7, 2013 at 10:44 AM, Neha Narkhede <[EMAIL PROTECTED]>wrote:
Joel Koshy 2013-01-08, 20:08
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext