Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Simple stastics per region


Copy link to this message
-
Re: Simple stastics per region
+1 for core. I can see that histograms might help us in automatic splits
and merges as well.
On Tue, Feb 26, 2013 at 3:27 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:

> If this is going to be a CP then other CPs need an easy way to use the
> output stats. If a subsequent proposal from core requires statistics from
> this CP does that then mandate it itself must be a CP? What if that can't
> work?
>
> Putting the stats into a table addresses the first concern.
>
> For the second, it is an issue that comes up I think when building a
> generally useful shared function as a CP. Please consider inserting my
> earlier comments about OSGi here, in that we trend toward a real module
> system if we're not careful (unless that is the aim).
>
>
> On Tue, Feb 26, 2013 at 2:31 PM, Jesse Yates <[EMAIL PROTECTED]
> >wrote:
>
> > TL;DR Making it part of the UI and ensuring that you don't load things
> the
> > wrong way seem to be the only reasons for making this part of core -
> > certainly not bad reasons. They are fairly easy to handle as a CP though,
> > so maybe its not necessary immediately.
> >
> > I ended up writing a simple stats framework last week (ok, its like 6
> > classes) that makes it easy to create your own stats for a table. Its all
> > coprocessor based, and as Lars suggested, hooks up to the major
> compactions
> > to let you build per-column-per-region stats and writes it to a 'system'
> > table = "_stats_".
> >
> > With the framework you could easily write your own custom stats, from
> > simple things like min/max keys to things like fixed width or fixed depth
> > histograms, or even more complicated. There has been some internal
> > discussion around how to make this available to the community (as part of
> > Phoenix, core in HBase, an independent github project, ...?).
> >
> > The biggest isssue around having it all CP based is that you need to be
> > really careful to ensure that it comes _after_ all the other compaction
> > coprocessors. This way you know exactly what keys come out and have
> correct
> > statistics (for that point in time). Not a huge issue - you just need to
> be
> > careful. Baking the stats framework into HBase is really nice in that we
> > can be sure we never mess this up.
> >
> > Building it into the core of HBase isn't going to get us per-region
> > statistics without a whole bunch of pain - compactions per store make
> this
> > a pain to actualize; there isn't a real advantage here, as I'd like to
> keep
> > it per CF, if only not to change all the things.
> >
> > Further, this would be a great first use-case for real system tables.
> > Mixing this data with .META. is going to be a bit of a mess, especially
> for
> > doing clean scans, etc. to read the stats. Also, I'd be gravely concerned
> > to muck with such important state, especially if we make a 'statistic' a
> > pluggable element (so people can easily expand their own).
> >
> > And sure, we could make it make pretty graphs on the UI, no harm in it
> and
> > very little overhead :)
> >
> > -------------------
> > Jesse Yates
> > @jesse_yates
> > jyates.github.com
> >
> >
> > On Tue, Feb 26, 2013 at 2:08 PM, Stack <[EMAIL PROTECTED]> wrote:
> >
> > > On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > This topic comes up now and then (see recent discussion about
> > translating
> > > > multi Gets into Scan+Filter).
> > > >
> > > > It's not that hard to keep statistics as part of compactions.
> > > > I envision two knobs:
> > > > 1. Max number of distinct values to track directly. If a column has
> > less
> > > > this # of values, keep track of their occurrences explicitly.
> > > > 2. Number of (equal width) histogram partitions to maintain.
> > > >
> > > > Statistics would be kept per store (i.e. per region per column
> family)
> > > and
> > > > stored into an HBase table (one row per store).Initially we could
> just
> > > > support major compactions that atomically insert a new version of
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB