Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> A general question on maxVersion handling when we have Secondary index tables


Copy link to this message
-
Re: A general question on maxVersion handling when we have Secondary index tables
Ted,

Ram's summarizes the concern succinctly -- to answer the specific question
it isn't for versions -- it is for the case where a secondary index can
point to many many primary rows.  (let's say we have a rowkey userid and we
want to have a 2ndary index based on the state portion of there address
 --- we'll end up pointing to many many primary rows).

Jon.

On Wed, Aug 29, 2012 at 7:15 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> Thanks for the detailed response, Jon.
>
> bq. it would mean that a query based on secondary index would
> potentially have to hit every region server that has a region in the
> primary table.
>
> Can you elaborate on the above a little bit ?
> Is this because secondary index would point us to more than one region in
> the data table because several versions are saved for the same row ?
>
> My thinking was to ease management of simultaneous (data and index) region
> split through region colocation.
>
> Cheers
>
> On Wed, Aug 29, 2012 at 6:47 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
>
> > I'm more of a fan of having secondary indexes added as an external
> feature
> > (coproc or new client library on top of our current client library) and
> > focusing on only adding apis necessary to make 2ndary indexes possible
> and
> > correct on/in HBase.  There are many different use patterns and
> > requirements and one style of secondary index will not be good for
> > everything.  Do we only care about this working well for highly
> selectivity
> > keys?  What are possible indexes (col name, value, value prefix,
> everything
> > our filters support?)  Do we care more about writes or reads, ACID
> > correctness or speed, etc?  Also, there are several questions about how
> we
> > handle other features in conjunction with 2ndary indexes: replication,
> bulk
> > load, snapshots, to name a few.
> >
> > Maybe it makes sense to spend some time defining what we want to index
> > secondarily and what a user api to this external api would be.  Then we
> > could have the different implementations under-the-covers, and allow for
> > users to swap implementations for the tradeoffs that fit their use cases.
> >  It wouldn't be free to change but hopefully "easy" from a user point of
> > view.
> >
> > Personally, I've tend to favor more of a percolator-style implementation
> --
> > it is a client library and built on top of hbase. This approach seems to
> be
> > more "HBase-style" with it's emphasis consistency and atomicity, and
> seems
> > to require only a few mondifications to HBase core. Sure it likely slower
> > than my read of Jesse's proposal, but it seems always always consistent
> and
> > thus predictable in cases where there are failures on deletes and
> updates.
> > We'd need  HBase API primitives like checkAndMutate call (check with
> > multiple delete/put on the same row), and possibly an atomic multitable
> > bulkload.  I'm not sure that it is replication compatible, and there are
> > probably questions we'll need to answer once snapshots solidifies.
> >
> > Ted's idea of colocating regions (like the index table's
> > regions) definitely feels like a primitive (pluggable, likely-per-table
> > region assignment plans) that we could add to HBase core. This
> requirement
> > though for 2ndary indexes seems to imply an approach similar to
> cassandra's
> > approach -- having a local index of each region on region server and
> > colocating them.  Is this right?  If so, this is essentially a filtering
> > optimization --  it would mean that a query based on secondary index
> would
> > potentially have to hit every region server that has a region in the
> > primary table.  This is great approach if the index lookup has high
> > cardinality but if the secondary index is highly selective, you'd have to
> > march through a bunch or RS's before getting an answer.
> >
> > Jon.
> >
> > On Tue, Aug 28, 2012 at 9:18 PM, Ramkrishna.S.Vasudevan <
> > [EMAIL PROTECTED]> wrote:
> >
> > > Hi
> > >
> > > Yes I was talking about the dead entry in the index table rather than

// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB