Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> feature request (count)


Copy link to this message
-
Re: feature request (count)
Storing numPuts, numDelets, and maxVersions of each block in the block index
could be useful.  If a block is all puts, no deletes, and maxVersions=1,
then you are more sure of the count.  If the block indexes indicate that no
other blocks overlap, then the count could be correct without ever hitting
the disk.

Those metrics could be useful for speeding up compactions as well.  Maybe
you can avoid uncompressing and recompressing the data block.
On Fri, Jun 3, 2011 at 3:56 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:

> Stats are a good idea, having a fuzzy count is sometimes good enough.
> Getting exact counts without actually reading the data will be very
> difficult.  Perhaps there will be future clever ideas that make this
> easier?
>
> On Fri, Jun 3, 2011 at 3:50 PM, Bill Graham <[EMAIL PROTECTED]> wrote:
> > One alternative option is to calculate some stats during compactions and
> > store that somewhere for retrieval. The metrics wouldn't be up to date of
> > course, since they've be stats from the last compaction time. I think
> that
> > would still be useful info to have, but it's different than what's being
> > requested.
> >
> >
> > On Fri, Jun 3, 2011 at 3:40 PM, Jack Levin <[EMAIL PROTECTED]> wrote:
> >
> >> "Each HFile knows how many KV entries there are in it, but this does
> >> not map in a general way to the
> >> number of rows, or the number of rows with a specific column."
> >>
> >> It would be nice to have an index like that;  Would solve a lot of
> >> issues for people migrating from mysql.  I assume that without the
> >> 'count' feature, people are resorting to storing dataset elements in
> >> other engines, which is not great, since you then end up to require a
> >> non-hbase index to be consistent and authoritative for all of your
> >> datasets that require counts.
> >>
> >> -Jack
> >>
> >>
> >> On Fri, Jun 3, 2011 at 3:24 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> >> > This is a commonly requested feature, and it remains unimplemented
> >> > because it is actually quite hard.  Each HFile knows how many KV
> >> > entries there are in it, but this does not map in a general way to the
> >> > number of rows, or the number of rows with a specific column. Keeping
> >> > track of the row count as new rows are created is also not as easy as
> >> > it seems - this is because a Put does not know if a row already exists
> >> > or not.  Making it aware of that fact would require doing a get before
> >> > a put - not cheap.
> >> >
> >> > -ryan
> >> >
> >> > On Fri, Jun 3, 2011 at 3:20 PM, Jack Levin <[EMAIL PROTECTED]> wrote:
> >> >> I have a feature request:  There should be a native function called
> >> >> 'count', that produces count of rows based on specific family filter,
> >> >> that is internal to HBASE and won't be required to read CELLs off the
> >> >> disk/cache.  Just count up the rows in the most efficient way
> >> >> possible.  I realize that family definitions are part of the cells,
> so
> >> >> it would be nice to have an index that somehow can produce low IO/CPU
> >> >> hit to hbase when doing a count (for example enabling an index like
> >> >> that in table schema would be how you turn it on for a specific
> >> >> family).
> >> >>
> >> >> Best,
> >> >>
> >> >> -Jack
> >> >>
> >> >
> >>
> >
>