Currently, partition is the smallest unit that we distribute data among
consumers (in the same consumer group). So, if the # of consumers is larger
than the total number of partitions in a Kafka cluster (across all
brokers), some consumers will never get any data. Such a decision is done
on a per topic basis. If a consumer consumes multiple topics, it would make
sense to divide partitions across all topics to consumers. We haven't done
that yet. Part of the reason is that we need to figure out how to balance
the data across topics since they can be of different sizes. We can look
into that post 0.8.
For now, the solution is to increase the number of partitions on the broker.
On Mon, Jan 7, 2013 at 9:03 AM, Pablo Barrera González <
[EMAIL PROTECTED]> wrote:
> We are starting to use Kafka in production but we found an unexpected (at
> least for me) behavior with the use of partitions. We have a bunch of
> topics with a few partitions each. We try to consume all data from several
> consumers (just one consumer group).
> The problem is in the rebalance step. The rebalance splits the partitions
> per topic between all consumers. So if you have 100 topics but only 2
> partitions each and 10 consumers only two consumers will be used. That is,
> for each topic all partitions will be listed and shared between the
> consumers in the consumer group in order (not randomly).
> This behavior is also described in algorithm 1 of the original kafka paper
> I don't understand this decision. Why is split by topic? Does it make sense
> to divide all partitions from all topics between all the consumers in the
> consumer group? I don't see the reason of this so I would like to hear your
> opinion before changing the code.
> We are using kafka 0.7.1.
> Thank you in advance
>  "Kafka: a Distributed Messaging System for Log Processing", Jay Kreps,
> Neha Narkhede and Jun Rao.