Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Re: [ANNOUNCE] Secondary Index in HBase - from Huawei


Copy link to this message
-
Re: [ANNOUNCE] Secondary Index in HBase - from Huawei
Vladimir,

I wasn't talking about anything outside of HBase.

The point I was trying to make was that if you are going to use an inverted table as your index, managing your index at the RS level is going to bite you in the ass and will cause more headaches down the road.

This is being done because they want to avoid the overhead of RPC calls. But you're in a distributed database where RPC is part of the ecosystem and its something that you have to deal with. (And you can do some basic design to decouple the write to the index from the base table. )

In addition to this, the use of an inverted table is just one of the options you have for a secondary index. You could also look at Lucene which we did a PoC a few years back.

Also beyond the secondary indexing, you have issues with coprocessors in general that should be addressed.
But that's a different story.

Please don't misunderstand, but while secondary indexing is a very important thing, going down the path of tying the index to the region is going down the wrong path.  

When you look at trying to integrate it in to Phoenix, you'll start to see the problems….

Hint:

Select * from tbl_foo where foo.A == Something And foo.B == SomethingElse

This is still pretty straight forward since you can take the sort ordered intersection by RS.

But then if you have the following:

SELECT *
FROM   tbl_foo , tbl_bar
WHERE tbl_foo.A == tbl_bar.A
AND       tbl_foo.C == Something
AND       tbl_bar.X == Something_Else

And you have indexes on A, C and X

That's actually 4 indexes. tbl_foo.A , tbl_foo.C , tbl_bar.A and tbl_bar.X

And here's the rub. You need to find the intersection of the complete index sets, not just on each node in order to do the join.

You need each of the indexes in sort order.

I'm not saying that you can't use the proposed solution, but that you will take a performance hit on the reads.

-Just saying…
On Aug 14, 2013, at 11:40 AM, Vladimir Rodionov <[EMAIL PROTECTED]> wrote:

> Michael, I do not think its the competitor to Solr, Solr/HBase or Cloudera
> Search, but it can be good addition to the HBase SQL front-end, such as
> Phoenix .
>
>
> On Wed, Aug 14, 2013 at 8:45 AM, Michael Segel <[EMAIL PROTECTED]>wrote:
>
>> Guys,
>>
>> Sorry to be a debbie downer here, but really this is not a good idea.
>> Here's why:
>>
>> In terms of design, you have some serious scalability and performance
>> issues when compared to alternatives.
>>
>>
>> Let me try to give you a real life example. *
>>
>> CCCIS (CCC Information Services) is the middle man in the US between the
>> auto repair shop and the insurance company. They have one competitor but
>> they handle most of the accident claims in the US.
>> So when you go to your authorized repair shop, they have this application
>> called Pathways which takes down all of your information and the accident,
>> the parts required to be replaced and sends it first to CCC which then
>> sends it on to your insurance company. In short CCC collects a lot of
>> information about the type of vehicles, the accidents, the cost of parts,
>> labor to put your car back on the road.  As the middle man they collect a
>> lot of very useful information…
>>
>> So imagine you have a large data warehouse in HBase of all of the claims.
>> Your primary key is going to be a composite of the insurer and the claim_id.
>>
>> But you're going to want to also index based on the make/model, type of
>> accident, driver details, location… , VIN
>>
>> This will allow your actuaries to figure out the average cost of a front
>> end collision, by make and model, by state/zip.
>> Or by age bracket, who's a better driver?
>>
>> Imagine that the claim table will have a column for the claim in its
>> entirety  as an Avro doc (JSON) along with the important fields broken out
>> separately.  (For this example the schema isn't that important.)
>>
>> So you want to find the average cost of a front end collision of a VOLVO
>> S80 for the past 3 model years.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB