So my understanding is num of active streams that a consumer can utilize is number of partitions in topic. This is fine if we consumer from specific topic. But if we consumer from TopicFilter, I thought consumer should able to utilize (number of topics that match filter * number of partitions in topic) . But looks like number of streams that consumer can use is limited by just number if partitions in topic although it's consuming from multiple topic.
Here what I observed with 1 mirrormaker consuming from whitelist '.+'.
The white list matches 5 topics and each topic has 8 partitions. I used consumer offset checker to look at owner of each/topic partition.
1) When I started mirrormaker with num.streams=8, all topics/partitions are distributed between 8 consumer threads.
2) When I started mirrormaker with num.streams=16, looks like 16 consumer threads were created, but only 8 are showing up as active as owner in consumer offset tracker and all topics/partitions are distributed between 8 consumer threads.
So this could be bottleneck for consumers as although we partitioned topic, if we are consuming from topic filter it can't utilize much of parallelism with num of streams. Am i missing something, is there a way to make cosumers/mirrormakers to utilize more number of active streams? Thanks, Raja.
threads were created, but only 8 are showing up as active as owner in consumer offset tracker and all topics/partitions are distributed between 8 consumer threads.
This is because currently the consumer rebalancing process of assigning partitions to consumer streams is at a per topic level. Unless you have at least one topic with 16 partitions, the remaining 8 threads will not do any work. This is not ideal and we want to look into a better rebalancing algorithm. Though it is a big change and we prefer doing it as part of the consumer client rewrite.
Thanks, Neha On Thu, Aug 29, 2013 at 8:03 AM, Rajasekar Elango <[EMAIL PROTECTED]>wrote:
So what is best way to load balance multiple consumers consuming from topic filter.
Let's say we have 4 topics with 8 partitions and 2 consumers.
Option 1) To load balance consumers, we can set num.streams=4 so that both consumers split 8 partitions. but can only use half of consumer streams.
Option 2) Configure mutually exclusive topic filter regex such that 2 topics will match consumer1 and 2 topics will match consumer2. Now we can set num.streams=8 and fully utilize consumer streams. I believe this will improve performance, but if consumer dies, we will not get any data from the topic used by that consumer.
What would be your recommendation?
Thanks, Raja. On Thu, Aug 29, 2013 at 12:42 PM, Neha Narkhede <[EMAIL PROTECTED]>wrote: Thanks, Raja.
If you read my previous posts, based on current re balancing logic, if we consumer from topic filter, consumer actively use all streams. Can you provide your recommendation of option 1 vs option 2 in my previous post?
Thanks, Raja. On Thu, Aug 29, 2013 at 11:42 PM, Jun Rao <[EMAIL PROTECTED]> wrote: Thanks, Raja.
Yeah. The actual bottleneck is actually number of topics that match the topic filter. Num of streams is going be shared between all topics it's consuming from. I thought about following ideas to work around this. (I am basically referring to mirrormaker consumer in examples).
Option 1). Instead of running one mirrormaker process with topic filter ".+", We can start multiple mirrormaker process with topic filter matching each topic (Eg: mirrormaker1 => whitelist topic1.* , mirrormaker2 => whitelist topic2.* etc)
But this adds some operations overhead to start and manage multiple processes on the host.
Option 2) Modify mirrormaker code to support list of whitelist filters and it should create message streams for each filter (call createMessageStreamsByFilter for each filter).
What would be your recommendation..? If adding feature to mirrormaker is worth kafka, we can do option 2.
Thanks, Raja. On Fri, Aug 30, 2013 at 10:34 AM, Jun Rao <[EMAIL PROTECTED]> wrote: Thanks, Raja.
It seems to me option 1) is easer. Option 2) has the same issue as option 1) since you have to manage different while lists.
A more general solution is probably to change the consumer distribution model to divide partitions across topics. That way, one can create as many streams as total # partitions for all topics. We can look into that in the future.
Jun On Fri, Aug 30, 2013 at 8:24 AM, Rajasekar Elango <[EMAIL PROTECTED]>wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext