Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Simple stastics per region


+
lars hofhansl 2013-02-23, 06:40
+
Andrew Purtell 2013-02-23, 17:41
+
lars hofhansl 2013-02-23, 20:39
+
Andrew Purtell 2013-02-23, 17:18
+
Stack 2013-02-26, 22:08
+
Jesse Yates 2013-02-26, 22:31
Copy link to this message
-
Re: Simple stastics per region
Andrew Purtell 2013-02-26, 23:27
If this is going to be a CP then other CPs need an easy way to use the
output stats. If a subsequent proposal from core requires statistics from
this CP does that then mandate it itself must be a CP? What if that can't
work?

Putting the stats into a table addresses the first concern.

For the second, it is an issue that comes up I think when building a
generally useful shared function as a CP. Please consider inserting my
earlier comments about OSGi here, in that we trend toward a real module
system if we're not careful (unless that is the aim).
On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED]>wrote:

> TL;DR Making it part of the UI and ensuring that you don't load things the
> wrong way seem to be the only reasons for making this part of core -
> certainly not bad reasons. They are fairly easy to handle as a CP though,
> so maybe its not necessary immediately.
>
> I ended up writing a simple stats framework last week (ok, its like 6
> classes) that makes it easy to create your own stats for a table. Its all
> coprocessor based, and as Lars suggested, hooks up to the major compactions
> to let you build per-column-per-region stats and writes it to a 'system'
> table = "_stats_".
>
> With the framework you could easily write your own custom stats, from
> simple things like min/max keys to things like fixed width or fixed depth
> histograms, or even more complicated. There has been some internal
> discussion around how to make this available to the community (as part of
> Phoenix, core in HBase, an independent github project, ...?).
>
> The biggest isssue around having it all CP based is that you need to be
> really careful to ensure that it comes _after_ all the other compaction
> coprocessors. This way you know exactly what keys come out and have correct
> statistics (for that point in time). Not a huge issue - you just need to be
> careful. Baking the stats framework into HBase is really nice in that we
> can be sure we never mess this up.
>
> Building it into the core of HBase isn't going to get us per-region
> statistics without a whole bunch of pain - compactions per store make this
> a pain to actualize; there isn't a real advantage here, as I'd like to keep
> it per CF, if only not to change all the things.
>
> Further, this would be a great first use-case for real system tables.
> Mixing this data with .META. is going to be a bit of a mess, especially for
> doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> to muck with such important state, especially if we make a 'statistic' a
> pluggable element (so people can easily expand their own).
>
> And sure, we could make it make pretty graphs on the UI, no harm in it and
> very little overhead :)
>
> -------------------
> Jesse Yates
> @jesse_yates
> jyates.github.com
>
>
> On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote:
>
> > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]>
> wrote:
> >
> > > This topic comes up now and then (see recent discussion about
> translating
> > > multi Gets into Scan+Filter).
> > >
> > > It's not that hard to keep statistics as part of compactions.
> > > I envision two knobs:
> > > 1. Max number of distinct values to track directly. If a column has
> less
> > > this # of values, keep track of their occurrences explicitly.
> > > 2. Number of (equal width) histogram partitions to maintain.
> > >
> > > Statistics would be kept per store (i.e. per region per column family)
> > and
> > > stored into an HBase table (one row per store).Initially we could just
> > > support major compactions that atomically insert a new version of that
> > > statistics for the store.
> > >
> > >
> > Sounds great.
> >
> > In .META. add columns for each each cf on each region row?  Or another
> > table?
> >
> > What kind of stats would you keep?  Would they be useful for operators?
>  Or
> > just for stuff like say Phoenix making decisions?
> >
> >
> >
> > > An simple implementation (not knowing ahead of time how many values it

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)
+
Enis Söztutar 2013-02-27, 00:15
+
lars hofhansl 2013-02-27, 00:27
+
Jesse Yates 2013-02-27, 00:31
+
Jesse Yates 2013-02-28, 01:52