We're considering adding Kafka monitoring to SPM (see http://sematext.com/spm/index.html ). We use Kafka ourselves and would like to see our Kafka metrics in SPM along with Hadoop, HBase, Solr, and other metrics we monitor.
My questions: * What do people currently use for monitoring Kafka? * Is there apetite for adding Kafka support to SPM? * How different is the structure/content of JMX for 0.7 vs. 0.8? Is 0.8 a backwards-compatible superset of 0.7?
I think there's really two angles to look at this from...
1) What is 'important' to monitor? Meaning, what subset of these are important/critical for being able to tell system health (things you want to set alerts on), what subset are nice to have for overall health and capacity planning (things you want to create pretty graphs on) and the rest (not immediately useful in general, but can really help in a debugging/triage situation).
2) How do you get the data? Kind of independent of the above, though kinda related as well.
As for the second one, you need to look at the collection mechanics. As you mentioned below, large scale polling (especially with a non-trivial number of beans) is expensive and problematic no matter how you do it (JMX or HTTP) given enough scale. I don't have much experience with the codahale metrics route directly, but I have messed with Jolokia, which is likely in the same boat - they expose the metrics for you to grab. In both cases, given enough data points (and kafka, depending on the number of topics involved, has a /lot/ of them), either can be slow if not implemented carefully. Meaning you may overrun your desired polling interval.
In very large environments, I've found it very scalable to have either a local poller on the box (which could be reading via JMX or HTTP) which then emits the data to something or have some kind of wrapper around the application that does the collection/emission (launching the broker as a thread, and the parent process dows some JMX magic to connect to the data points). Both of these routes depend a lot on your monitoring infrastructure, but they will help you get around the general wide polling problem...