You’ll have to excuse Andy.
He’s a bit slow. HBASE-13044 should have been done 2 years ago. And it was trivial. Just got done last month….
But I digress… The long story short…
HBASE-9203 was brain dead from inception. Huawei’s idea was to index on the region which had two problems.
1) Complexity in that they wanted to keep the index on the same region server
2) Joins become impossible. Well, actually not impossible, but incredibly slow when compared to the alternative.
You really should go back to the email chain.
Their defense (including Salesforce who was going to push this approach) fell apart when you asked the simple question on how do you handle joins?
That’s their OOPS moment. Once you start to understand that, then allowing the index to be orthogonal to the base table, things started to come together.
In short, you have a query either against a single table, or if you’re doing a join. You then get the indexes and assuming that you’re only using the AND predicate, its a simple intersection of the index result sets. (Since the result sets are ordered, its relatively trivial to walk through and find the intersections of N Lists in a single pass.)
Now you have your result set of base table row keys and you can work with that data. (Either returning the records to the client, or as input to a map/reduce job.
That’s the 30K view. There’s more to it, but once Salesforce got the basic idea, they ran with it. It was really that simple concept that the index would be orthogonal to the base table that got them moving in the right direction.
To Joseph’s point, indexing isn’t necessarily an RDBMS feature. However, it seems that some of the Committers are suffering from rectal induced hypoxia. HBASE-12853 was created not just to help solve the issue of ‘hot spotting’ but also to get the Committers to focus on bringing the solutions that they glum on in the client, back to the server side of things.
Unfortunately the last great attempt at fixing things on the server side was the bastardization of coprocessors which again, suffers from the lack of thought. This isn’t to say that allowing users to extend the server side functionality is wrong. (Because it isn’t.) But that the implementation done in HBase is a tad lacking in thought.
So in terms of indexing…
Longer term picture, there has to be some fixes on the server side of things to allow one to associate an index (allowing for different types) to a base table, yet the implementation of using the index would end up becoming a client. And by client, it would be an external query engine processor that could/should sit on the cluster.
But hey! What do I know?
I gave up trying to have an intelligent/civilized conversation with Andrew because he just couldn’t grasp the basics. ;-)
The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
michael_segel (AT) hotmail.com