I'm looking into some better ways of estimating high-percentile latency,
since I don't think the existing sampling-based method does a good job. I
wrote up a document and put it on HBASE-6261 that outlines what I think are
the available options; I'd encourage any resident stats experts / people
interested in metrics to take a look.
I wanted to ask off JIRA though about what would be useful in practice. I
think it'd be nice to see, for example, accurate 90th and 99th percentile
latency over recent 10s, 1m, 5m, and 15m time windows. I found some nice
algos to do this, I think at the cost of MBs of memory.
If this sounds like overkill though, there are even cheaper algos that
provide a more qualitative feeling of how the latency distribution is
changing over time. You give up one or more of bounded error, exact
percentiles, or time-based windows, but you do get the general feeling of
up vs. down.
So, is the "full" solution compelling enough to proceed? Anything