Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> A general question on maxVersion handling when we have Secondary index tables


Copy link to this message
-
Re: A general question on maxVersion handling when we have Secondary index tables
Ted,

Ram's summarizes the concern succinctly -- to answer the specific question
it isn't for versions -- it is for the case where a secondary index can
point to many many primary rows.  (let's say we have a rowkey userid and we
want to have a 2ndary index based on the state portion of there address
 --- we'll end up pointing to many many primary rows).

Jon.

On Wed, Aug 29, 2012 at 7:15 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Thanks for the detailed response, Jon.
>
> bq. it would mean that a query based on secondary index would
> potentially have to hit every region server that has a region in the
> primary table.
>
> Can you elaborate on the above a little bit ?
> Is this because secondary index would point us to more than one region in
> the data table because several versions are saved for the same row ?
>
> My thinking was to ease management of simultaneous (data and index) region
> split through region colocation.
>
> Cheers
>
> On Wed, Aug 29, 2012 at 6:47 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
>
> > I'm more of a fan of having secondary indexes added as an external
> feature
> > (coproc or new client library on top of our current client library) and
> > focusing on only adding apis necessary to make 2ndary indexes possible
> and
> > correct on/in HBase.  There are many different use patterns and
> > requirements and one style of secondary index will not be good for
> > everything.  Do we only care about this working well for highly
> selectivity
> > keys?  What are possible indexes (col name, value, value prefix,
> everything
> > our filters support?)  Do we care more about writes or reads, ACID
> > correctness or speed, etc?  Also, there are several questions about how
> we
> > handle other features in conjunction with 2ndary indexes: replication,
> bulk
> > load, snapshots, to name a few.
> >
> > Maybe it makes sense to spend some time defining what we want to index
> > secondarily and what a user api to this external api would be.  Then we
> > could have the different implementations under-the-covers, and allow for
> > users to swap implementations for the tradeoffs that fit their use cases.
> >  It wouldn't be free to change but hopefully "easy" from a user point of
> > view.
> >
> > Personally, I've tend to favor more of a percolator-style implementation
> --
> > it is a client library and built on top of hbase. This approach seems to
> be
> > more "HBase-style" with it's emphasis consistency and atomicity, and
> seems
> > to require only a few mondifications to HBase core. Sure it likely slower
> > than my read of Jesse's proposal, but it seems always always consistent
> and
> > thus predictable in cases where there are failures on deletes and
> updates.
> > We'd need  HBase API primitives like checkAndMutate call (check with
> > multiple delete/put on the same row), and possibly an atomic multitable
> > bulkload.  I'm not sure that it is replication compatible, and there are
> > probably questions we'll need to answer once snapshots solidifies.
> >
> > Ted's idea of colocating regions (like the index table's
> > regions) definitely feels like a primitive (pluggable, likely-per-table
> > region assignment plans) that we could add to HBase core. This
> requirement
> > though for 2ndary indexes seems to imply an approach similar to
> cassandra's
> > approach -- having a local index of each region on region server and
> > colocating them.  Is this right?  If so, this is essentially a filtering
> > optimization --  it would mean that a query based on secondary index
> would
> > potentially have to hit every region server that has a region in the
> > primary table.  This is great approach if the index lookup has high
> > cardinality but if the secondary index is highly selective, you'd have to
> > march through a bunch or RS's before getting an answer.
> >
> > Jon.
> >
> > On Tue, Aug 28, 2012 at 9:18 PM, Ramkrishna.S.Vasudevan <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi
> > >
> > > Yes I was talking about the dead entry in the index table rather than

// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]