Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Re: HBase Developer's Pow-wow.


+
N Keywal 2012-08-30, 07:38
+
Stack 2012-08-30, 04:25
+
Ramkrishna.S.Vasudevan 2012-08-30, 04:35
+
Stack 2012-08-30, 04:56
+
Jonathan Hsieh 2012-08-29, 18:30
+
Ted Yu 2012-08-30, 17:20
+
Devaraj Das 2012-08-29, 20:06
+
Stack 2012-08-29, 20:32
+
Ramkrishna.S.Vasudevan 2012-08-30, 04:21
+
Devaraj Das 2012-08-29, 21:43
+
Jonathan Hsieh 2012-08-29, 23:12
+
Devaraj Das 2012-08-30, 06:12
+
Ramkrishna.S.Vasudevan 2012-08-30, 07:05
+
Andrew Purtell 2012-08-30, 06:58
+
Jimmy Xiang 2012-08-29, 20:11
+
Andrew Purtell 2012-08-29, 20:15
+
Lars George 2012-08-30, 22:04
+
Devaraj Das 2012-08-30, 22:36
+
Stack 2012-08-30, 22:42
+
Stack 2012-08-31, 22:59
+
Stack 2012-09-03, 15:40
+
Ramkrishna.S.Vasudevan 2012-09-05, 04:18
+
Stack 2012-09-09, 22:08
+
Jesse Yates 2012-09-09, 22:11
+
lars hofhansl 2012-09-10, 22:46
+
Stack 2012-09-09, 22:21
+
Jesse Yates 2012-09-09, 22:25
+
Stack 2012-09-09, 22:44
+
Jacques 2012-09-10, 03:03
+
Jacques 2012-09-10, 07:03
Copy link to this message
-
Re: HBase Developer's Pow-wow.
Andrew Purtell 2012-09-10, 18:09
On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote:
>    - How important is indexing column qualifiers themselves (similar to
>    Cassandra where people frequently utilize column qualifiers as "values"
>    with no actual values stored)?

It would be good to have a secondary indexing option that can build an
index from some transform of family+qualifier.

>    - In general it seems like there is tension between the main low level
>    approaches of (1) leverage as much HBase infrastructure as possible (e.g.
>    secondary tables) and (2) leverage an efficient indexing library e.g.
>    Lucene.

Regarding option #2, Jason Rutherglen's experiences may be of
interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new
Codec and CodecProvider classes of Lucene 4 could conceivably support
storage of postings in HBase proper now
(http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks
for bringing indexes local for mmapping may not be necessary, though
this is a huge hand-wave.

The remainder of your mail is focused on option #1, I have no comment
to add there, lots of food for thought.

> *
> *
> *Approach Thoughts*
> Trying to leverage HBase as much as possible is hard if we want to utilize
> the approach above and have consistent indexing.  However, I think we can
> do it if we add support for what I will call a "local shadow family".
>  These are additional, internally managed families for a table.  However,
> they have the special characteristic that they belong to the region despite
> their primary keys being outside the range of the region's.  Otherwise they
> look like a typical family.  On splits, they are regenerated (somehow).  If
> we take advantage of Lars'
> HBASE-5229<https://issues.apache.org/jira/browse/HBASE-5229>,
> we then have the opportunity to consistently insert one or more rows into
> these local shadow families for the purpose of secondary indexing. The
> structure of these secondary families could use row keys as the indexed
> values, qualifiers for specific store files and the value of each being a
> list of originating keys (using read-append or
> HBASE-5993<https://issues.apache.org/jira/browse/HBASE-5993>).
>  By leveraging the existing family infrastructure, we get things like
> optional in-memory indexes and basic scanners for free and don't have to
> swallow a big chunk of external indexing code.
>
> The simplest approach for integration of these for queries would be
> internally be a  ScannerBasedFilter (a filter that is based on a scanner)
> and a GroupingScanner (a Scanner that does intersection and/or union of
> scanners for multi criteria queries).  Implementation of these scanners
> could happen at one of two levels:
>
>    - StoreScanner level: A more efficient approach using the store file
>    qualifier approach above (this allows easier maintenance of index
>    deletions)
>    - RegionScanner level: A simpler implementation with less violation of
>    existing encapsulation.  We'd store row keys in qualifiers instead of
>    values to ensure ordering that works iteratively with RegionScanner.  The
>    weaknesses of this approach are less efficient scanning and figuring out
>    how to manage primary value deletes.
>
> In general, the best way to deal with deletes is probably to age them out
> per storefile and just filter "near misses" as a secondary filter that
> works with ScannerBasedFilter.  The client side would be TBD but would
> probably offer some kind of criteria filters that on server side had all
> the lower level ramifications.
>
> *Future Optimizations*
> In a perfect world, we'd actually use StoreFile block start locations as
> the index pointer values in the secondary families.  This would make things
> much more compact and efficient.  Especially if we used a smarter block
> codec that took advantage of this nature.  However, this requires quite a
> bit more work since we'd need to actually use the primary keys in the
> secondary memstore and then "patch" the values to block locations as we

Best regards,

   - Andy

Problems worthy of attack prove their worth by hitting back. - Piet
Hein (via Tom White)
+
Matt Corgan 2012-09-10, 19:13
+
Jacques 2012-09-10, 23:40
+
Matt Corgan 2012-09-11, 01:20
+
Jacques 2012-09-11, 04:04
+
Andrew Purtell 2012-09-11, 04:22
+
Ramkrishna.S.Vasudevan 2012-09-11, 04:47
+
Ted Yu 2012-09-10, 17:51
+
Jacques 2012-09-10, 20:45
+
Stack 2012-09-10, 04:41
+
Andrew Purtell 2012-09-10, 17:58
+
Jacques 2012-09-10, 20:50
+
Ted Yu 2012-08-29, 18:40
+
Devaraj Das 2012-09-11, 00:21
+
Matt Corgan 2012-09-11, 05:59
+
Stack 2012-09-05, 04:36