Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Developer's Pow-wow.


Copy link to this message
-
Re: HBase Developer's Pow-wow.
Jacques:
Thanks for your sharing.

bq. row-level sharding as opposed to term

Please elaborate on the above a little more: what is term sharding ?

bq. for what I will call a "local shadow family"

I like this idea. User may request more than one index. Currently HBase is
not so good at serving high number of families. So we may need to watch out.

bq. GroupingScanner (a Scanner that does intersection and/or union of
scanners for multi criteria queries)

Do you think the following enhancement is related to your proposal above ?
HBASE-5416 Improve performance of scans with some kind of filters

bq. and then "patch" the values to block locations as we
flushed the primary family that we were indexing (ugh).

Yeah. We also need to consider the effect of compaction.

bq. my intuition is it would be fine as long as we didn't fix HBASE-3149

I was actually expecting someone to pick up the work of HBASE-3149 :-)

Cheers

On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote:

> more food for thought on secondary indexing...
>
> *Additional questions*:
>
>    - How important is indexing column qualifiers themselves (similar to
>    Cassandra where people frequently utilize column qualifiers as "values"
>    with no actual values stored)?
>    - How important is indexing cell timestamps?
>
>
> *More thoughts/my answers on some of the questions I posed:*
>
>    - From my experience, indexes should be at the region level (e.g.
>    row-level sharding as opposed to term).  Other sharding approaches will
>    likely have scale and consistency problems.
>    - In general it seems like there is tension between the main low level
>    approaches of (1) leverage as much HBase infrastructure as possible
> (e.g.
>    secondary tables) and (2) leverage an efficient indexing library e.g.
>    Lucene.
>
> *
> *
> *Approach Thoughts*
> Trying to leverage HBase as much as possible is hard if we want to utilize
> the approach above and have consistent indexing.  However, I think we can
> do it if we add support for what I will call a "local shadow family".
>  These are additional, internally managed families for a table.  However,
> they have the special characteristic that they belong to the region despite
> their primary keys being outside the range of the region's.  Otherwise they
> look like a typical family.  On splits, they are regenerated (somehow).  If
> we take advantage of Lars'
> HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
> we then have the opportunity to consistently insert one or more rows into
> these local shadow families for the purpose of secondary indexing. The
> structure of these secondary families could use row keys as the indexed
> values, qualifiers for specific store files and the value of each being a
> list of originating keys (using read-append or
> HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
>  By leveraging the existing family infrastructure, we get things like
> optional in-memory indexes and basic scanners for free and don't have to
> swallow a big chunk of external indexing code.
>
> The simplest approach for integration of these for queries would be
> internally be a  ScannerBasedFilter (a filter that is based on a scanner)
> and a GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries).  Implementation of these scanners
> could happen at one of two levels:
>
>    - StoreScanner level: A more efficient approach using the store file
>    qualifier approach above (this allows easier maintenance of index
>    deletions)
>    - RegionScanner level: A simpler implementation with less violation of
>    existing encapsulation.  We'd store row keys in qualifiers instead of
>    values to ensure ordering that works iteratively with RegionScanner.
>  The
>    weaknesses of this approach are less efficient scanning and figuring out
>    how to manage primary value deletes.
>
> In general, the best way to deal with deletes is probably to age them out