Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Re: [ANNOUNCE] Secondary Index in HBase - from Huawei


Copy link to this message
-
Re: [ANNOUNCE] Secondary Index in HBase - from Huawei
Guys,

Sorry to be a debbie downer here, but really this is not a good idea. Here's why:

In terms of design, you have some serious scalability and performance issues when compared to alternatives.
Let me try to give you a real life example. *

CCCIS (CCC Information Services) is the middle man in the US between the auto repair shop and the insurance company. They have one competitor but they handle most of the accident claims in the US.  
So when you go to your authorized repair shop, they have this application called Pathways which takes down all of your information and the accident, the parts required to be replaced and sends it first to CCC which then sends it on to your insurance company. In short CCC collects a lot of information about the type of vehicles, the accidents, the cost of parts, labor to put your car back on the road.  As the middle man they collect a lot of very useful information…

So imagine you have a large data warehouse in HBase of all of the claims. Your primary key is going to be a composite of the insurer and the claim_id.  

But you're going to want to also index based on the make/model, type of accident, driver details, location… , VIN

This will allow your actuaries to figure out the average cost of a front end collision, by make and model, by state/zip.
Or by age bracket, who's a better driver?

Imagine that the claim table will have a column for the claim in its entirety  as an Avro doc (JSON) along with the important fields broken out separately.  (For this example the schema isn't that important.)

So you want to find the average cost of a front end collision of a VOLVO S80 for the past 3 model years.

Now, you have an index based on manufacturer/model/year.

Using your index scheme, you now have to query every RS for the row keys in the index.
Then you have to take these results and then put them in a sort order in order to use the index.

Note: This isn't too bad if you're doing a simple query against one index. You can do the work by RS and then join the results from all RS.

However… what happens if you have two indexes and your result set is going to be the intersection of the indexes?

Or you're going to do a join between two tables using the indexes to limit the result set?

Now your design breaks down quickly.

And then there's another problem.
Your index may be relatively much smaller than your base table.
In this example… the insurance claim is a huge record.  I would say 2-3 orders of magnitude  larger than the row key.  Since you split your index at the same rate you split your table… you will have a lot of regions for your index.

Again,this may lead to other issues….

Is it better than doing a full table scan? Sure.

Are there better alternatives?
Yes.
Apply KISS. (Keep it simple)

Still using an inverted table, let HBase manage it rather than trying to tie it to the underlying base table.
While its not perfect, its lighter, and will perform better in the general use cases.  (You could even use Async HBase to decouple the write to the base table and the update to the index.)

Same model could be applied to a Lucene index as well.

Just Saying….

-Mike
*FULL DISCLOSURE
I am a consultant and CCC was a client of mine back in the late '90s.  In one project I worked on ProEFT (now defunct) and an ODS, also now defunct.  The example is a hypothetical of what I would do if I were CCC and wanted to use Big Data to help manage Auto claims. Any resemblance to any actual work being done by CCC in the Big Data space is pure coincidence. ;-)

On Aug 13, 2013, at 1:31 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote:

> Thanks so much for the contribution!
>
> On Mon, Aug 12, 2013 at 11:19 PM, rajeshbabu chintaguntla <
> [EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> We have been working on implementing secondary index in HBase, and had
>> shared an overview of our design in the 2012  Hadoop Technical Conference
>> at Beijing(http://bit.ly/hbtc12-hindex). We are pleased to open source it
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB