HBase, mail # dev - Simple stastics per region

lars hofhansl 2013-02-23, 06:40
Andrew Purtell 2013-02-23, 17:41
lars hofhansl 2013-02-23, 20:39
Andrew Purtell 2013-02-23, 17:18
Stack 2013-02-26, 22:08
Jesse Yates 2013-02-26, 22:31
Andrew Purtell 2013-02-26, 23:27
Enis Söztutar 2013-02-27, 00:15
Re: Simple stastics per region
lars hofhansl 2013-02-27, 00:27
Just had a discussion with the Phoenix folks (my cubicle neighbors :) ).
Turns out that the types of problem we're trying to solve for Phoenix would need equal-depth histograms, whereas for decisions such as picking a 2ndary index equal-width histograms are often used.
So a key in this is a proper framework through, which, stats can hooked up and calculated. OSGi for coprocessors would be nice, but may also be overkill for this.
Maybe something like the chores framework would work.

In either case, there will be core stats (that would allow HBase to decide between a scan and a multi get), and user defined stats to help higher layers such as Phoenix, or an indexing library.
-- Lars

 Enis Söztutar
Sent: Tuesday, February 26, 2013 4:15 PM
Subject: Re: Simple stastics per region
+1 for core. I can see that histograms might help us in automatic splits
and merges as well.
Andrew Purtell

> If this is going to be a CP then other CPs need an easy way to use the
> output stats. If a subsequent proposal from core requires statistics from
> this CP does that then mandate it itself must be a CP? What if that can't
> work?
> Putting the stats into a table addresses the first concern.
> For the second, it is an issue that comes up I think when building a
> generally useful shared function as a CP. Please consider inserting my
> earlier comments about OSGi here, in that we trend toward a real module
> system if we're not careful (unless that is the aim).
> On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED]
> >wrote:
> > TL;DR Making it part of the UI and ensuring that you don't load things
> the
> > wrong way seem to be the only reasons for making this part of core -
> > certainly not bad reasons. They are fairly easy to handle as a CP though,
> > so maybe its not necessary immediately.
> >
> > I ended up writing a simple stats framework last week (ok, its like 6
> > classes) that makes it easy to create your own stats for a table. Its all
> > coprocessor based, and as Lars suggested, hooks up to the major
> compactions
> > to let you build per-column-per-region stats and writes it to a 'system'
> > table = "_stats_".
> >
> > With the framework you could easily write your own custom stats, from
> > simple things like min/max keys to things like fixed width or fixed depth
> > histograms, or even more complicated. There has been some internal
> > discussion around how to make this available to the community (as part of
> > Phoenix, core in HBase, an independent github project, ...?).
> >
> > The biggest isssue around having it all CP based is that you need to be
> > really careful to ensure that it comes _after_ all the other compaction
> > coprocessors. This way you know exactly what keys come out and have
> correct
> > statistics (for that point in time). Not a huge issue - you just need to
> be
> > careful. Baking the stats framework into HBase is really nice in that we
> > can be sure we never mess this up.
> >
> > Building it into the core of HBase isn't going to get us per-region
> > statistics without a whole bunch of pain - compactions per store make
> this
> > a pain to actualize; there isn't a real advantage here, as I'd like to
> keep
> > it per CF, if only not to change all the things.
> >
> > Further, this would be a great first use-case for real system tables.
> > Mixing this data with .META. is going to be a bit of a mess, especially
> for
> > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> > to muck with such important state, especially if we make a 'statistic' a
> > pluggable element (so people can easily expand their own).
> >
> > And sure, we could make it make pretty graphs on the UI, no harm in it
> and
> > very little overhead :)
> >
> > -------------------
> > Jesse Yates
> > @jesse_yates
Jesse Yates 2013-02-27, 00:31
Jesse Yates 2013-02-28, 01:52