Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - A general question on maxVersion handling when we have Secondary index tables


Copy link to this message
-
Re: A general question on maxVersion handling when we have Secondary index tables
Jonathan Hsieh 2012-08-29, 13:47
I'm more of a fan of having secondary indexes added as an external feature
(coproc or new client library on top of our current client library) and
focusing on only adding apis necessary to make 2ndary indexes possible and
correct on/in HBase.  There are many different use patterns and
requirements and one style of secondary index will not be good for
everything.  Do we only care about this working well for highly selectivity
keys?  What are possible indexes (col name, value, value prefix, everything
our filters support?)  Do we care more about writes or reads, ACID
correctness or speed, etc?  Also, there are several questions about how we
handle other features in conjunction with 2ndary indexes: replication, bulk
load, snapshots, to name a few.

Maybe it makes sense to spend some time defining what we want to index
secondarily and what a user api to this external api would be.  Then we
could have the different implementations under-the-covers, and allow for
users to swap implementations for the tradeoffs that fit their use cases.
 It wouldn't be free to change but hopefully "easy" from a user point of
view.

Personally, I've tend to favor more of a percolator-style implementation --
it is a client library and built on top of hbase. This approach seems to be
more "HBase-style" with it's emphasis consistency and atomicity, and seems
to require only a few mondifications to HBase core. Sure it likely slower
than my read of Jesse's proposal, but it seems always always consistent and
thus predictable in cases where there are failures on deletes and updates.
We'd need  HBase API primitives like checkAndMutate call (check with
multiple delete/put on the same row), and possibly an atomic multitable
bulkload.  I'm not sure that it is replication compatible, and there are
probably questions we'll need to answer once snapshots solidifies.

Ted's idea of colocating regions (like the index table's
regions) definitely feels like a primitive (pluggable, likely-per-table
region assignment plans) that we could add to HBase core. This requirement
though for 2ndary indexes seems to imply an approach similar to cassandra's
approach -- having a local index of each region on region server and
colocating them.  Is this right?  If so, this is essentially a filtering
optimization --  it would mean that a query based on secondary index would
potentially have to hit every region server that has a region in the
primary table.  This is great approach if the index lookup has high
cardinality but if the secondary index is highly selective, you'd have to
march through a bunch or RS's before getting an answer.

Jon.

On Tue, Aug 28, 2012 at 9:18 PM, Ramkrishna.S.Vasudevan <
[EMAIL PROTECTED]> wrote:

> Hi
>
> Yes I was talking about the dead entry in the index table rather than the
> actual data table.
>
> Regards
> Ram
>
> > -----Original Message-----
> > From: Wei Tan [mailto:[EMAIL PROTECTED]]
> > Sent: Tuesday, August 28, 2012 9:22 PM
> > To: [EMAIL PROTECTED]
> > Cc: Sandeep Tata
> > Subject: Re: A general question on maxVersion handling when we have
> > Secondary index tables
> >
> > Thanks for sharing a pointer to your implementation.
> > My two cents:
> > timestamp is a way to do MVCC and setting every KV with the same TS
> > will
> > get concurrency control very tricky and error prone, if not impossible
> > I think Ram is talking about the dead entry in the index table rather
> > than
> > data table. Deleting old index entries upfront when there is a new put
> > might be a choice.
> >
> >
> > Best Regards,
> > Wei
> >
> > Wei Tan
> > Research Staff Member
> > IBM T. J. Watson Research Center
> > 19 Skyline Dr, Hawthorne, NY  10532
> > [EMAIL PROTECTED]; 914-784-6752
> >
> >
> >
> > From:   Jesse Yates <[EMAIL PROTECTED]>
> > To:     [EMAIL PROTECTED],
> > Date:   08/28/2012 04:00 AM
> > Subject:        Re: A general question on maxVersion handling when we
> > have
> > Secondary index tables
> >
> >
> >
> > Ram,
> >
> > If I understand correctly, I think you can design your index such that
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]