Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo >> mail # dev >> GSOC: Monitor Improvements


Copy link to this message
-
Re: GSOC: Monitor Improvements
On Mon, Apr 22, 2013 at 11:42 AM, Mike Drob <[EMAIL PROTECTED]> wrote:

> Adding on to the comment about summaries, averages, and outliers. If, for
> some reason, you end up with a two-hump population, then simply showing
> averages will mask the split and lose a lot of valuable information. It is
> often valuable to know that a particular set of users or servers are
> experiencing degraded performance while the rest of the ecosystem is
> healthy.
>
> This isn't something that shows up in a regular time series because the
> secondary population is usually very small compared to the total
> population. There was a graph for request latency of a service that I saw
> once that I really wish I could find again, maybe somebody on the list will
> be able to chime in - It had timestamps on the x-axis, latency on the y,
> and each (x,y) point was colored on a gradient representing how many
> requests were fulfilled at time x with latency y. This chart make it
> immediately easy to see that most data points fit a normal distribution
> with a low mean, but there was also a cluster at the top for some reason.
>
That sounds really cool.  Maybe the y-axis/latency could be log scale.
Inevitably a 3004 second operation will finish and obscure the
smaller latencies.

Sometimes its more useful to sample this type of info from the clients
rather than tablet servers.   A tablet server may report low latencies, but
all clients using may experience high latencies because of a network issue.
  We could certainly consider making the client code report this info.
>
> I'd love to see that type of chart show up for tablet servers (probably not
> as useful for tables).
>
> Mike
>
>
> On Mon, Apr 22, 2013 at 9:05 AM, Eric Newton <[EMAIL PROTECTED]>
> wrote:
>
> > Another thing to consider is scale.  On large clusters (many hundreds of
> > nodes), more data is not helpful for visualization.  Instead, summaries,
> > averages and outliers are important.
> >
> > For example, if one node is consistently slow, it is better to know that
> > than to see one graph with low numbers in a sea of graphs.
> >
> > If the monitor collects information using JMX, collection time for each
> > node would be a good thing to know, too.
> >
> > -Eric
> >
> >
> > On Sun, Apr 21, 2013 at 10:00 PM, Josh Elser <[EMAIL PROTECTED]>
> wrote:
> >
> > > Supun,
> > >
> > > Yup, very much so. Having a way to consume any and all metrics via JMX
> > > would simplify things for any consumers (internal or external).
> > >
> > >
> > >
> > > On 04/21/2013 02:15 PM, Supun Kamburugamuva wrote:
> > >
> > >> Hi Josh,
> > >>
> > >> Thanks for the suggestions. I'll incorporate these to the proposal.
> > >>
> > >> Another area I would like to work is on JMX. There is a Jira that says
> > to
> > >> replace the Monitor calls from Thrift to JMX (Accumulo 694). Do you
> > think
> > >> this is a good addition to the Monitor?
> > >>
> > >> Thanks,
> > >> Supun..
> > >>
> > >>
> > >> On Sun, Apr 21, 2013 at 1:45 PM, Josh Elser <[EMAIL PROTECTED]>
> > wrote:
> > >>
> > >>  Supun,
> > >>>
> > >>> Looks good! Can I make some suggestions/comments?
> > >>>
> > >>> For: "Per table plots: ACCUMULO-594", I'd also like to see minor
> > >>> compactions, major compactions, index cache hit rate, and data cache
> > hit
> > >>> rate per table (same graphs that are displayed system-wide when you
> > visit
> > >>> http://${MONITOR_HOST}:50095/.
> > >>>
> > >>> For "Per tablet [server] plots", it would be neat if you could also
> > >>> extract some general statistics like top N least performing, top N
> > >>> highest
> > >>> performing, etc. tablet servers. Ideally, this could correlate with
> > >>> servers
> > >>> that may be having problems :).
> > >>>
> > >>> Do you see these proposed changes as being sufficient for 3-4 months
> of
> > >>> 40hrs/week work? If you plan to really dig into these changes
> (perhaps
> > >>> reworking components of the monitor itself), I could perhaps see
> this.
> > Do
> > >>> you have any ideas for more lofty goals that you could pursue as
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB