Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Simple stastics per region


Copy link to this message
-
Re: Simple stastics per region
TL;DR Making it part of the UI and ensuring that you don't load things the
wrong way seem to be the only reasons for making this part of core -
certainly not bad reasons. They are fairly easy to handle as a CP though,
so maybe its not necessary immediately.

I ended up writing a simple stats framework last week (ok, its like 6
classes) that makes it easy to create your own stats for a table. Its all
coprocessor based, and as Lars suggested, hooks up to the major compactions
to let you build per-column-per-region stats and writes it to a 'system'
table = "_stats_".

With the framework you could easily write your own custom stats, from
simple things like min/max keys to things like fixed width or fixed depth
histograms, or even more complicated. There has been some internal
discussion around how to make this available to the community (as part of
Phoenix, core in HBase, an independent github project, ...?).

The biggest isssue around having it all CP based is that you need to be
really careful to ensure that it comes _after_ all the other compaction
coprocessors. This way you know exactly what keys come out and have correct
statistics (for that point in time). Not a huge issue - you just need to be
careful. Baking the stats framework into HBase is really nice in that we
can be sure we never mess this up.

Building it into the core of HBase isn't going to get us per-region
statistics without a whole bunch of pain - compactions per store make this
a pain to actualize; there isn't a real advantage here, as I'd like to keep
it per CF, if only not to change all the things.

Further, this would be a great first use-case for real system tables.
Mixing this data with .META. is going to be a bit of a mess, especially for
doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
to muck with such important state, especially if we make a 'statistic' a
pluggable element (so people can easily expand their own).

And sure, we could make it make pretty graphs on the UI, no harm in it and
very little overhead :)

-------------------
Jesse Yates
@jesse_yates
jyates.github.com
On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote:

> On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
>
> > This topic comes up now and then (see recent discussion about translating
> > multi Gets into Scan+Filter).
> >
> > It's not that hard to keep statistics as part of compactions.
> > I envision two knobs:
> > 1. Max number of distinct values to track directly. If a column has less
> > this # of values, keep track of their occurrences explicitly.
> > 2. Number of (equal width) histogram partitions to maintain.
> >
> > Statistics would be kept per store (i.e. per region per column family)
> and
> > stored into an HBase table (one row per store).Initially we could just
> > support major compactions that atomically insert a new version of that
> > statistics for the store.
> >
> >
> Sounds great.
>
> In .META. add columns for each each cf on each region row?  Or another
> table?
>
> What kind of stats would you keep?  Would they be useful for operators?  Or
> just for stuff like say Phoenix making decisions?
>
>
>
> > An simple implementation (not knowing ahead of time how many values it
> > will see during the compaction) could start by keeping track of
> individual
> > values for columns. If it gets past the max # of distinct values to
> track,
> > start with equal width histograms (using the distinct values picket up so
> > far to estimate an initial partition width).
> > If the number of partition gets larger than what was configured it would
> > increase the width and merge the previous counts into the new width
> (which
> > means the new partition width must be a multiple of the previous size).
> > There's probably a lot of other fanciness that could be used here
> (haven't
> > spent a lot of time thinking about details).
> >
> >
> > Is this something that should be in core HBase or rather be implemented
> as
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB