Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Simple stastics per region


Copy link to this message
-
Re: Simple stastics per region
Andrew Purtell 2013-02-23, 17:18
I like this idea. For the kind of decisions I would like to make based on
CF statistics (characterizing access patterns), histograms or
approximations would be fine.

+1 on a core HBase facility.

Otherwise, let's consider how would statistics kept by such a CP be shared
with other CPs or clients.

Coprocessors (deliberately) is not a module system. There are no provisions
for exporting functions from one coprocessor and importing them in another,
and managing the dependencies between. If we were to do so, it really might
be appropriate to rebase CPs on an OSGi runtime rather than incrementally
reinvent that wheel and end up with something close anyway but with more
bugs. We have avoided anything like this so far of course. We had a guy
from Apache Felix come by and suggest that once. To see what's involved
with that go here: http://felix.apache.org/
On Fri, Feb 22, 2013 at 10:40 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> This topic comes up now and then (see recent discussion about translating
> multi Gets into Scan+Filter).
>
> It's not that hard to keep statistics as part of compactions.
> I envision two knobs:
> 1. Max number of distinct values to track directly. If a column has less
> this # of values, keep track of their occurrences explicitly.
> 2. Number of (equal width) histogram partitions to maintain.
>
> Statistics would be kept per store (i.e. per region per column family) and
> stored into an HBase table (one row per store).Initially we could just
> support major compactions that atomically insert a new version of that
> statistics for the store.
>
> An simple implementation (not knowing ahead of time how many values it
> will see during the compaction) could start by keeping track of individual
> values for columns. If it gets past the max # of distinct values to track,
> start with equal width histograms (using the distinct values picket up so
> far to estimate an initial partition width).
> If the number of partition gets larger than what was configured it would
> increase the width and merge the previous counts into the new width (which
> means the new partition width must be a multiple of the previous size).
> There's probably a lot of other fanciness that could be used here (haven't
> spent a lot of time thinking about details).
>
>
> Is this something that should be in core HBase or rather be implemented as
> coprocessor?
>
>
> -- Lars
>

--
Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein
(via Tom White)