Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - question about threaded vs. serial consumers

Copy link to this message
Re: question about threaded vs. serial consumers
Joel Koshy 2012-07-09, 21:45
Hi Jim,

If you use a single machine and configure your consumer with this
topic-count: "topic:N" you will get behavior (a). If N < available
partitions some threads will be allocated more partitions than the others.
If you use more than one machine (or JVMs) and have each of those consumers
configured with topic:X, each of your consumer instances will get that many
active threads (i.e., if there are enough partitions to share).

For (c): if a partition is not getting written to then yes the thread that
is allocated to it will block (until consumer.timeout.ms) - and it needs to
since new data may come to that partition at any time. BTW, if you are
asking about a single topic then this question would only arise if you are
using a custom partitioning scheme on the producer side since the default
partitioner (which is random) (should) ensure that all partitions would get
data eventually. However, if you are consuming multiple topics then it is
quite possible for some topics to not receive data (if their producers

Are there are options?  Is there a way to set up a non-blocking poll
> of a KafkaStream to see whether or not it has a message available?

You can use the offset shell to check the current available offsets
periodically - but that is generally unnecessary. The typical set up is to
just allocate a certain number of threads for consumption - if any of the
partitions don't have any more data then they just stay idle. Also, the
broker has a retention setting (defaults to a week I think) - so partitions
that don't receive data would expire eventually and trigger a rebalance on
the consumers.