Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase Developer's Pow-wow.

Copy link to this message
Re: HBase Developer's Pow-wow.
Can indexing be boiled down to these questions to start?

1) Per-region or Per-table
2) Sync or Async
3) Client-managed or Server-managed
4) Schema or Schema-less


- Per-region: the index entries are stored on the same machine as the
primary rows
- Per-table: each index is stored in a separate table, requiring
cross-server consistency

- Sync: the client blocks until all index entries exist
- Async: the client returns when the primary row has been inserted, but
indexes are guaranteed to be created eventually

- Client-managed: client pushes index entries directly to regions, possibly
utilizing some server-side locks or id generators
- Server-managed: client pushes index entries to the same server as the
primary row, letting the server push the index entries on to the
destination regions

- Schema: (complex to even define) client and/or server have info about
column names, value formats, etc.  (Taking this route opens a world of
follow-on questions)
- Schema-less: client provides the index entries which are rows with opaque
row/family/qualifier/timestamp like in normal hbase

Personal opinions:

All of my use-cases would require Per-table indexes.  Per-region is easier
to keep consistent at write-time, but is seems useless to me for the large
tables that hbase is designed for (because you have to hit every region for
each read).

I think Synchronous writes is important for high-consistency (OLTP style)
uses cases while Async is important for high-throughput (OLAP style).  I'd
say sync is a more desirable feature because it's easier to roll your own
async.  I would love to see the difference reduced to a per-index-entry
flag on the Put object.

Client-managed vs Server-managed isn't tremendously important.
 Client-managed seems admirable for the sync case, but server-managed is
better for async.  Therefore, probably better to keep the api simple and do
server-managed for both cases with a flag for sync/async.

The notion of adding a schema to hbase for secondary indexing scares me a
little.  Many of us already have ORM-type layers above hbase that do all
sorts of custom serializations.  It would be more flexible to let the
client generate abritrary index entries and ship them to the server inside
the Put object.

Anyway - my abbreviated 2 cents on a big topic.

On Mon, Sep 10, 2012 at 11:09 AM, Andrew Purtell <[EMAIL PROTECTED]>wrote:

> On Mon, Sep 10, 2012 at 12:03 AM, Jacques <[EMAIL PROTECTED]> wrote:
> >    - How important is indexing column qualifiers themselves (similar to
> >    Cassandra where people frequently utilize column qualifiers as
> "values"
> >    with no actual values stored)?
> It would be good to have a secondary indexing option that can build an
> index from some transform of family+qualifier.
> >    - In general it seems like there is tension between the main low level
> >    approaches of (1) leverage as much HBase infrastructure as possible
> (e.g.
> >    secondary tables) and (2) leverage an efficient indexing library e.g.
> >    Lucene.
> Regarding option #2, Jason Rutherglen's experiences may be of
> interest: https://issues.apache.org/jira/browse/HBASE-3529 . The new
> Codec and CodecProvider classes of Lucene 4 could conceivably support
> storage of postings in HBase proper now
> (http://wiki.apache.org/lucene-java/FlexibleIndexing) so HDFS hacks
> for bringing indexes local for mmapping may not be necessary, though
> this is a huge hand-wave.
> The remainder of your mail is focused on option #1, I have no comment
> to add there, lots of food for thought.
> > *
> > *
> > *Approach Thoughts*
> > Trying to leverage HBase as much as possible is hard if we want to
> utilize
> > the approach above and have consistent indexing.  However, I think we can
> > do it if we add support for what I will call a "local shadow family".
> >  These are additional, internally managed families for a table.  However,
> > they have the special characteristic that they belong to the region