Kafka, Work Distribution, and Work Stealing
We use Kafka to distribute batches work across several boxes. These batches
may take anywhere from 1 hour to 24 hours to complete. Currently, we have N
partitions, each allocated to one of N consumer worker boxes.
We find that as the batch nears completion, with only M <
N partitions still containing work, our throughput suffers as now only M
workers are doing the work. This problem is amplified if several of our N
boxes are slower than the others, as our throughput is now tied to our
slowest box. It would be great if workers that were done with their work
could share the load.
I am wondering if anyone has dealt with this issue. It doesn't seem that
Kafka has a solution for this out of the box (though I could be wrong), but
maybe others have clever solutions, best practices, or suggestions for
other technologies to add to the mix.
I am especially curious how the Kafka philosophy applies to this situation.