Kafka, mail # user - Re: Kafka, Work Distribution, and Work Stealing - 2013-01-16, 18:14
Solr & Elasticsearch trainings in New York & San Francisco [more info][hide]
 Search Hadoop and all its subprojects:

Switch to Plain View
+
David Ross 2013-01-16, 17:51
Copy link to this message
-
Re: Kafka, Work Distribution, and Work Stealing
David,

It looks like the consumer throughput suffers because of imbalance of data
across partitions. When you say the batch nears completion, it seems like
the number of partitions that have new data reduces leading to fewer
consumer instances processing large amount of data. Is that true ?

In Kafka, having more partitions for a topic allows you to increase the I/O
parallelism by allowing the writes for the data to go in parallel. At the
same time, it allows you to scale the consumption over a cluster of
machines, since partitions is the smallest granularity of consumer
parallelism. For your use case, what is worth looking into is a work
distribution strategy that shards data over the available Kafka partitions
somewhat evenly.

Thanks,
Neha

On Wed, Jan 16, 2013 at 9:47 AM, David Ross <[EMAIL PROTECTED]> wrote:
 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB