It looks like the consumer throughput suffers because of imbalance of data
across partitions. When you say the batch nears completion, it seems like
the number of partitions that have new data reduces leading to fewer
consumer instances processing large amount of data. Is that true ?

In Kafka, having more partitions for a topic allows you to increase the I/O
parallelism by allowing the writes for the data to go in parallel. At the
same time, it allows you to scale the consumption over a cluster of
machines, since partitions is the smallest granularity of consumer
parallelism. For your use case, what is worth looking into is a work
distribution strategy that shards data over the available Kafka partitions
somewhat evenly.


On Wed, Jan 16, 2013 at 9:47 AM, David Ross <[EMAIL PROTECTED]> wrote:
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB