Thanks for helping out so far.
As per your explanation we are doing exactly as you have mentioned in your workaround below.
Here is the problem...
We have a topic which gets a lot of events (around a million in a day), so this topic on the server has a high number of partitions, and we have dedicated consumers only listening to this topic and the processing time is in the order of 15-30 millis. So we are assured that our consumers are not slow in processing.
Every now then, it so happens, that our consumers threads stalls and do not receive any events (as suggested in my previous email with the thread stack on idle threads) even though we can see the offset lag increasing for the consumers.
We also noticed that if we force rebalance the consumers (either by starting a new consumer or killing an existing one) data starts to flow in again to these consumer threads. The consumers remains stable (processing events) for about 20-30 mins before the threads go idle again and the backlog starts growing. This happens in a cycle for us and we are not able to figure out the cause for events not flowing in.
As a side note, we are also monitoring the GC cycles and there are hardly any.
Please let us know if you need any additional details.
On 10-Jul-2013, at 8:30 PM, Jun Rao <[EMAIL PROTECTED]> wrote: