Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Developer's Pow-wow.


Copy link to this message
-
Re: HBase Developer's Pow-wow.
On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote:
>    - How important is indexing column qualifiers themselves (similar to
>    Cassandra where people frequently utilize column qualifiers as "values"
>    with no actual values stored)?

It would be good to have a secondary indexing option that can build an
index from some transform of family+qualifier.

>    - In general it seems like there is tension between the main low level
>    approaches of (1) leverage as much HBase infrastructure as possible (e.g.
>    secondary tables) and (2) leverage an efficient indexing library e.g.
>    Lucene.

Regarding option #2, Jason Rutherglen's experiences may be of
interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new
Codec and CodecProvider classes of Lucene 4 could conceivably support
storage of postings in HBase proper now
(http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks
for bringing indexes local for mmapping may not be necessary, though
this is a huge hand-wave.

The remainder of your mail is focused on option #1, I have no comment
to add there, lots of food for thought.

> *
> *
> *Approach Thoughts*
> Trying to leverage HBase as much as possible is hard if we want to utilize
> the approach above and have consistent indexing.  However, I think we can
> do it if we add support for what I will call a "local shadow family".
>  These are additional, internally managed families for a table.  However,
> they have the special characteristic that they belong to the region despite
> their primary keys being outside the range of the region's.  Otherwise they
> look like a typical family.  On splits, they are regenerated (somehow).  If
> we take advantage of Lars'
> HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
> we then have the opportunity to consistently insert one or more rows into
> these local shadow families for the purpose of secondary indexing. The
> structure of these secondary families could use row keys as the indexed
> values, qualifiers for specific store files and the value of each being a
> list of originating keys (using read-append or
> HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
>  By leveraging the existing family infrastructure, we get things like
> optional in-memory indexes and basic scanners for free and don't have to
> swallow a big chunk of external indexing code.
>
> The simplest approach for integration of these for queries would be
> internally be a  ScannerBasedFilter (a filter that is based on a scanner)
> and a GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries).  Implementation of these scanners
> could happen at one of two levels:
>
>    - StoreScanner level: A more efficient approach using the store file
>    qualifier approach above (this allows easier maintenance of index
>    deletions)
>    - RegionScanner level: A simpler implementation with less violation of
>    existing encapsulation.  We'd store row keys in qualifiers instead of
>    values to ensure ordering that works iteratively with RegionScanner.  The
>    weaknesses of this approach are less efficient scanning and figuring out
>    how to manage primary value deletes.
>
> In general, the best way to deal with deletes is probably to age them out
> per storefile and just filter "near misses" as a secondary filter that
> works with ScannerBasedFilter.  The client side would be TBD but would
> probably offer some kind of criteria filters that on server side had all
> the lower level ramifications.
>
> *Future Optimizations*
> In a perfect world, we'd actually use StoreFile block start locations as
> the index pointer values in the secondary families.  This would make things
> much more compact and efficient.  Especially if we used a smarter block
> codec that took advantage of this nature.  However, this requires quite a
> bit more work since we'd need to actually use the primary keys in the
> secondary memstore and then "patch" the values to block locations as we

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB