It looks like the consumer throughput suffers because of imbalance of data
across partitions. When you say the batch nears completion, it seems like
the number of partitions that have new data reduces leading to fewer
consumer instances processing large amount of data. Is that true ?
In Kafka, having more partitions for a topic allows you to increase the I/O
parallelism by allowing the writes for the data to go in parallel. At the
same time, it allows you to scale the consumption over a cluster of
machines, since partitions is the smallest granularity of consumer
parallelism. For your use case, what is worth looking into is a work
distribution strategy that shards data over the available Kafka partitions
On Wed, Jan 16, 2013 at 9:47 AM, David Ross <[EMAIL PROTECTED]> wrote:
> We use Kafka to distribute batches work across several boxes. These batches
> may take anywhere from 1 hour to 24 hours to complete. Currently, we have N
> partitions, each allocated to one of N consumer worker boxes.
> We find that as the batch nears completion, with only M <
> N partitions still containing work, our throughput suffers as now only M
> workers are doing the work. This problem is amplified if several of our N
> boxes are slower than the others, as our throughput is now tied to our
> slowest box. It would be great if workers that were done with their work
> could share the load.
> I am wondering if anyone has dealt with this issue. It doesn't seem that
> Kafka has a solution for this out of the box (though I could be wrong), but
> maybe others have clever solutions, best practices, or suggestions for
> other technologies to add to the mix.
> I am especially curious how the Kafka philosophy applies to this situation.