We're seeing an issue running 0.7.0 where one or more of our consumers are pausing after about an hour when we have a lot of threads configured. Our setup is as follows: * 4 brokers configured for 32 threads and 32 partitions on each broker.
* 2 consumers each processing 40 streams (24 and 16).
* Zookeeper server is a CDH version that's at least 3.3.4.
We were also seeing this with 3 consumers running 18 threads each. As you can tell, the hardware is quite beefy and the brokers are described as being "bored."
Outside of upgrading to 0.7.2, which we are planning on doing but can't yet, what else can we look into to try to resolve this or at least determine what's happening?
Yes, we have. Our SA where this is occurring has been monitoring this. When the consumers went down, we could see that things were lagging. Yesterday, they lowered the number of threads for the consumers to six each and they haven't shut down yet. There appears to still be some lag, but since the consumers are running, it's decreasing.
A test was run with each broker configured to have 32 partitions each and when the number of threads across the consumers exceeds 32, then we have issues. My understanding from the documentation is that when you set the number of partitions on a broker, it's just for that broker, correct? Therefore, if we set each broker to have 32 partitions, across 4 brokers we should have 128 partitions per topic, correct? In which case, we should be able to run 128 consumer threads with ease.