If this is going to be a CP then other CPs need an easy way to use the
output stats. If a subsequent proposal from core requires statistics from
this CP does that then mandate it itself must be a CP? What if that can't
Putting the stats into a table addresses the first concern.
For the second, it is an issue that comes up I think when building a
generally useful shared function as a CP. Please consider inserting my
earlier comments about OSGi here, in that we trend toward a real module
system if we're not careful (unless that is the aim).
On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED]>wrote:
> TL;DR Making it part of the UI and ensuring that you don't load things the
> wrong way seem to be the only reasons for making this part of core -
> certainly not bad reasons. They are fairly easy to handle as a CP though,
> so maybe its not necessary immediately.
> I ended up writing a simple stats framework last week (ok, its like 6
> classes) that makes it easy to create your own stats for a table. Its all
> coprocessor based, and as Lars suggested, hooks up to the major compactions
> to let you build per-column-per-region stats and writes it to a 'system'
> table = "_stats_".
> With the framework you could easily write your own custom stats, from
> simple things like min/max keys to things like fixed width or fixed depth
> histograms, or even more complicated. There has been some internal
> discussion around how to make this available to the community (as part of
> Phoenix, core in HBase, an independent github project, ...?).
> The biggest isssue around having it all CP based is that you need to be
> really careful to ensure that it comes _after_ all the other compaction
> coprocessors. This way you know exactly what keys come out and have correct
> statistics (for that point in time). Not a huge issue - you just need to be
> careful. Baking the stats framework into HBase is really nice in that we
> can be sure we never mess this up.
> Building it into the core of HBase isn't going to get us per-region
> statistics without a whole bunch of pain - compactions per store make this
> a pain to actualize; there isn't a real advantage here, as I'd like to keep
> it per CF, if only not to change all the things.
> Further, this would be a great first use-case for real system tables.
> Mixing this data with .META. is going to be a bit of a mess, especially for
> doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> to muck with such important state, especially if we make a 'statistic' a
> pluggable element (so people can easily expand their own).
> And sure, we could make it make pretty graphs on the UI, no harm in it and
> very little overhead :)
> Jesse Yates
> On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote:
> > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]>
> > > This topic comes up now and then (see recent discussion about
> > > multi Gets into Scan+Filter).
> > >
> > > It's not that hard to keep statistics as part of compactions.
> > > I envision two knobs:
> > > 1. Max number of distinct values to track directly. If a column has
> > > this # of values, keep track of their occurrences explicitly.
> > > 2. Number of (equal width) histogram partitions to maintain.
> > >
> > > Statistics would be kept per store (i.e. per region per column family)
> > and
> > > stored into an HBase table (one row per store).Initially we could just
> > > support major compactions that atomically insert a new version of that
> > > statistics for the store.
> > >
> > >
> > Sounds great.
> > In .META. add columns for each each cf on each region row? Or another
> > table?
> > What kind of stats would you keep? Would they be useful for operators?
> > just for stuff like say Phoenix making decisions?
> > > An simple implementation (not knowing ahead of time how many values it
Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)