Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> Re: A general question on maxVersion handling when we have Secondary index tables


+
Jonathan Hsieh 2012-08-29, 13:47
+
Ted Yu 2012-08-29, 14:15
+
Ramkrishna.S.Vasudevan 2012-08-29, 15:12
+
Jonathan Hsieh 2012-08-29, 16:11
+
Ted Yu 2012-08-29, 16:19
+
Jesse Yates 2012-08-29, 17:03
+
Ted Yu 2012-08-29, 17:07
+
Jonathan Hsieh 2012-08-29, 18:18
+
Ramkrishna.S.Vasudevan 2012-08-30, 04:34
+
Jonathan Hsieh 2012-08-29, 17:46
+
Ramkrishna.S.Vasudevan 2012-08-30, 04:18
+
Jesse Yates 2012-08-28, 07:59
+
Ramkrishna.S.Vasudevan 2012-08-28, 08:51
+
Wei Tan 2012-08-28, 15:52
+
Ramkrishna.S.Vasudevan 2012-08-29, 04:18
+
Ted Yu 2012-08-28, 16:03
+
Stack 2012-08-29, 22:32
Copy link to this message
-
Re: A general question on maxVersion handling when we have Secondary index tables
@Ted: Are you proposing re-opening the should we have secondary indexes in
HBase discussion? If so, I'm +1 on adding them. Wanna file a jira?

@Wei Tan: Yeah, I generally agree. However, I think you can get away with
ignoring MVCC and just keep an index on the latest key (where key
_includes_ the timestamp) and then do lazy cleanup.

@Ram: if you move the TS into the CQ you can remove the actual TS (so it
costs you some minor computational overhead to pull it out), still giving
you the right answer without actually using HBase timestamps.

I've proposed that you can just do an async cleanup of the index when you
find out its stale, with minimal overhead to the clients. Otherwise, yes,
you would need a way to tie together the versions in the index and primary
tables, which you don't always want to keep exactly the same.

Also, there is an issue when returning the version of the row based on the
indexed TS. Should you return the whole row? Should you return just the
parts of the row with timestamps the same age or older? For the latter, how
you do know which parts of the row to return when you have two versions of
the same column that was indexed (which other row elements should be
include based on TS)? I'd propose all questions that need to be answered if
we are going to do a general hbase index.
-------------------
Jesse Yates
@jesse_yates
jyates.github.com
On Tue, Aug 28, 2012 at 9:03 AM, Ted Yu <[EMAIL PROTECTED]> wrote:

> I think this discussion should be on HBASE JIRA.
>
> Another dimension to secondary indexing is the co-location (or pairing) of
> data table region and index table region. Related regions from the two
> tables should be placed on the same region server.
>
> Cheers
>
> On Tue, Aug 28, 2012 at 8:52 AM, Wei Tan <[EMAIL PROTECTED]> wrote:
>
> > Thanks for sharing a pointer to your implementation.
> > My two cents:
> > timestamp is a way to do MVCC and setting every KV with the same TS will
> > get concurrency control very tricky and error prone, if not impossible
> > I think Ram is talking about the dead entry in the index table rather
> than
> > data table. Deleting old index entries upfront when there is a new put
> > might be a choice.
> >
> >
> > Best Regards,
> > Wei
> >
> > Wei Tan
> > Research Staff Member
> > IBM T. J. Watson Research Center
> > 19 Skyline Dr, Hawthorne, NY  10532
> > [EMAIL PROTECTED]; 914-784-6752
> >
> >
> >
> > From:   Jesse Yates <[EMAIL PROTECTED]>
> > To:     [EMAIL PROTECTED],
> > Date:   08/28/2012 04:00 AM
> > Subject:        Re: A general question on maxVersion handling when we
> have
> > Secondary index tables
> >
> >
> >
> > Ram,
> >
> > If I understand correctly, I think you can design your index such that
> you
> > don't actually use the timestamp (e.g. everything gets put with a TS = 10
> > -
> > or some other non-special, relatively small number that's not 0 as I'd
> > worry about that in HBase ;) Then when you set maxVersions to 1,
> > everything
> > should be good.
> >
> > You get a couple of wasted bytes from the TS, but with the prefixTrie
> > stuff
> > that should be pretty minimal overhead. If you do need to keep track of
> > the
> > timestamp you should be able to munge that back up into the column
> > qualifier (and just know that that last 64 bits is the timestamp). Again
> a
> > little more CPU cost, but its really not that big of an overhead. It
> seems
> > like you don't really care about the TS though, in which case this should
> > be pretty simple.
> >
> > Out of curiosity, what are people using for their secondary indexing
> > solutions? I know there are a bunch out there, but don't know what people
> > have adopted, what they like/dislike, design tradeoffs made and why.
> >
> > Disclaimer: I recently proposed a secondary indexing solution myself
> > (shameless self-plug:
> >
> >
> http://jyates.github.com/2012/07/09/consistent-enough-secondary-indexes.html
> > )
> > and its something I'm working on for Salesforce - open sourced at some
> > point, promise!
+
Ted Yu 2012-08-28, 17:34
+
Ramkrishna.S.Vasudevan 2012-08-28, 07:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB