Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - A general question on maxVersion handling when we have Secondary index tables


Copy link to this message
-
Re: A general question on maxVersion handling when we have Secondary index tables
Jonathan Hsieh 2012-08-29, 17:46
Let me rephrase to make sure I'm on the same page for the ram's question:

We do three inserts on row 1 at different times to the same column (which
is being indexed in a secondary table)  (Are we assuming only a 1-to-1
secondary->primary mapping?)

t1< t2 <t3
put ("row1", "cf:c", "val1", t1)
put ("row1", "cf:c", "val2", t2)
put ("row1", "cf:c", "val3", t3)

What happens is in the primary table we have:

row1 / cf:c = val1 @ t1
row1 / cf:c = val3 @ t2
row1 / cf:c = val3 @ t3

I'm assuming that these writes happen to a secondary table like this:
put ("val1", "r", "row1", t1)
put ("val2", "r", "row1", t2)
put ("val3", "r", "row1", t3)

an in the secondary table we have:

val1 / r = row1 @ t1
val2 / r = row1 @ t2
val3 / r = row1 @ t3

The core question is how and when can we efficiently and correctly get rid
of the now invalid val1, val2 rows in the index table.

Let's look at some of the strawmen:
1) periodic scan of secondary table that add delete markers for invalid
entries (removed on major compact)
2) lazily delete marker on reads that are invalid (we are @t4, attempt to
read via "val2" in 2ndary index, see primary value is invalid so do a
checkAndDelete val2 from 2ndary).  would get removed on major compact.
3) delete on update.  This means we need to know if we are modifying a
value and thus incurs a at least an extra read per write.

Ram, does this seem like the right question and potential options to
consider?

Jon.

On Wed, Aug 29, 2012 at 8:12 AM, Ramkrishna.S.Vasudevan <
[EMAIL PROTECTED]> wrote:

> When we have many to one mapping between main and secondary index table may
> be we will end up in hitting many RS. If there is one to one mapping may be
> that is not a problem.
>
> Basically my intention of this discussion was mainly to discuss on the
> version maintenance on any type of secondary index particularly to remove
> the stale data in the index table that would have expired.
>
> Regards
> Ram
>
>
> > -----Original Message-----
> > From: Ted Yu [mailto:[EMAIL PROTECTED]]
> > Sent: Wednesday, August 29, 2012 7:45 PM
> > To: [EMAIL PROTECTED]
> > Subject: Re: A general question on maxVersion handling when we have
> > Secondary index tables
> >
> > Thanks for the detailed response, Jon.
> >
> > bq. it would mean that a query based on secondary index would
> > potentially have to hit every region server that has a region in the
> > primary table.
> >
> > Can you elaborate on the above a little bit ?
> > Is this because secondary index would point us to more than one region
> > in
> > the data table because several versions are saved for the same row ?
> >
> > My thinking was to ease management of simultaneous (data and index)
> > region
> > split through region colocation.
> >
> > Cheers
> >
> > On Wed, Aug 29, 2012 at 6:47 AM, Jonathan Hsieh <[EMAIL PROTECTED]>
> > wrote:
> >
> > > I'm more of a fan of having secondary indexes added as an external
> > feature
> > > (coproc or new client library on top of our current client library)
> > and
> > > focusing on only adding apis necessary to make 2ndary indexes
> > possible and
> > > correct on/in HBase.  There are many different use patterns and
> > > requirements and one style of secondary index will not be good for
> > > everything.  Do we only care about this working well for highly
> > selectivity
> > > keys?  What are possible indexes (col name, value, value prefix,
> > everything
> > > our filters support?)  Do we care more about writes or reads, ACID
> > > correctness or speed, etc?  Also, there are several questions about
> > how we
> > > handle other features in conjunction with 2ndary indexes:
> > replication, bulk
> > > load, snapshots, to name a few.
> > >
> > > Maybe it makes sense to spend some time defining what we want to
> > index
> > > secondarily and what a user api to this external api would be.  Then
> > we
> > > could have the different implementations under-the-covers, and allow
> > for
> > > users to swap implementations for the tradeoffs that fit their use
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]