We are starting to use Kafka in production but we found an unexpected (at
least for me) behavior with the use of partitions. We have a bunch of
topics with a few partitions each. We try to consume all data from several
consumers (just one consumer group).

The problem is in the rebalance step. The rebalance splits the partitions
per topic between all consumers. So if you have 100 topics but only 2
partitions each and 10 consumers only two consumers will be used. That is,
for each topic all partitions will be listed and shared between the
consumers in the consumer group in order (not randomly).

This behavior is also described in algorithm 1 of the original kafka paper

I don't understand this decision. Why is split by topic? Does it make sense
to divide all partitions from all topics between all the consumers in the
consumer group? I don't see the reason of this so I would like to hear your
opinion before changing the code.

We are using kafka 0.7.1.

Thank you in advance


[1] "Kafka: a Distributed Messaging System for Log Processing", Jay Kreps,
Neha Narkhede and Jun Rao.

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB