Subject: Socket timeouts fetching metadata
We are seeing some odd socket timeouts from one of our producers. This
producer fans out data from one topic into dozens or hundreds of potential
output topics. We batch the send's to write 1,000 messages at a time.
The odd thing is that the timeouts are happening in the socket read, so I
assume that the socket.timeout.ms value applies, which we leave as the
default of 30 seconds. The odd thing is that these exceptions are getting
thrown in clusters of 3-5 at a time with just a few seconds or less in
between each. We are running with 64 network threads in our brokers, which
seems plenty given that the broker has only 8 cores. From the clustering
of timeouts, it looks perhaps like we are issuing multiple metadata
requests in parallel. Is that true?
We haven't touched the io threads (still set at 2), but I'm wondering if
these are just artifacts of congestion in the communication between the
brokers and our clients. Are we using too many distinct topics (~95) and
should we try to cut down on them as a way to smooth the message exchanges
between broker and client? I think that we are expecting the number of
topics in production to be much higher than these values.
It does appear that the producer in this case is able to continue sending,
but these exceptions in the logs make our testers unhappy.
I won't include the very lengthy log messages in toto, but the stack traces