Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Simple stastics per region


Copy link to this message
-
Re: Simple stastics per region
+1 for core. I can see that histograms might help us in automatic splits
and merges as well.
On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:

> If this is going to be a CP then other CPs need an easy way to use the
> output stats. If a subsequent proposal from core requires statistics from
> this CP does that then mandate it itself must be a CP? What if that can't
> work?
>
> Putting the stats into a table addresses the first concern.
>
> For the second, it is an issue that comes up I think when building a
> generally useful shared function as a CP. Please consider inserting my
> earlier comments about OSGi here, in that we trend toward a real module
> system if we're not careful (unless that is the aim).
>
>
> On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED]
> >wrote:
>
> > TL;DR Making it part of the UI and ensuring that you don't load things
> the
> > wrong way seem to be the only reasons for making this part of core -
> > certainly not bad reasons. They are fairly easy to handle as a CP though,
> > so maybe its not necessary immediately.
> >
> > I ended up writing a simple stats framework last week (ok, its like 6
> > classes) that makes it easy to create your own stats for a table. Its all
> > coprocessor based, and as Lars suggested, hooks up to the major
> compactions
> > to let you build per-column-per-region stats and writes it to a 'system'
> > table = "_stats_".
> >
> > With the framework you could easily write your own custom stats, from
> > simple things like min/max keys to things like fixed width or fixed depth
> > histograms, or even more complicated. There has been some internal
> > discussion around how to make this available to the community (as part of
> > Phoenix, core in HBase, an independent github project, ...?).
> >
> > The biggest isssue around having it all CP based is that you need to be
> > really careful to ensure that it comes _after_ all the other compaction
> > coprocessors. This way you know exactly what keys come out and have
> correct
> > statistics (for that point in time). Not a huge issue - you just need to
> be
> > careful. Baking the stats framework into HBase is really nice in that we
> > can be sure we never mess this up.
> >
> > Building it into the core of HBase isn't going to get us per-region
> > statistics without a whole bunch of pain - compactions per store make
> this
> > a pain to actualize; there isn't a real advantage here, as I'd like to
> keep
> > it per CF, if only not to change all the things.
> >
> > Further, this would be a great first use-case for real system tables.
> > Mixing this data with .META. is going to be a bit of a mess, especially
> for
> > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> > to muck with such important state, especially if we make a 'statistic' a
> > pluggable element (so people can easily expand their own).
> >
> > And sure, we could make it make pretty graphs on the UI, no harm in it
> and
> > very little overhead :)
> >
> > -------------------
> > Jesse Yates
> > @jesse_yates
> > jyates.github.com
> >
> >
> > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote:
> >
> > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > This topic comes up now and then (see recent discussion about
> > translating
> > > > multi Gets into Scan+Filter).
> > > >
> > > > It's not that hard to keep statistics as part of compactions.
> > > > I envision two knobs:
> > > > 1. Max number of distinct values to track directly. If a column has
> > less
> > > > this # of values, keep track of their occurrences explicitly.
> > > > 2. Number of (equal width) histogram partitions to maintain.
> > > >
> > > > Statistics would be kept per store (i.e. per region per column
> family)
> > > and
> > > > stored into an HBase table (one row per store).Initially we could
> just
> > > > support major compactions that atomically insert a new version of