Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: HBase - Secondary Index

Copy link to this message
RE: HBase - Secondary Index
    If the scan from client side does not specify any rowkey range but only the filter condition, yes it will go to all the primary table regions for the scan. There 1st it will scan the index table region and seek to exact rows in the main table region.  If that region is not having any data at all corresponding to the filter condition, the entire region will get skipped simply.

In a normal scan also, if there is a rowkey range that we can specify, then only to specific regions the request will go. In the sec index case of ours also it is same..

In a simple way what I can say is for the scan there is no change at all wrt the operation that is what is happening at the client side. From the meta data to know which all region and RSs to contact, and contacting that regions one by one and getting data from that region. Only difference is what is happening at the server side. With out index the whole data from all the Hfiles will get fetched at the server side and the filter will get applied for every row. Only those rows which passes the filter will get back to the client side.  With index, when the scanning happen at the server side, the index data will get scanned 1st from the index region. This region will be in the same RS so no extra RPCs. The data to be scanned from the index table will be limited.. We can create the start key and stop key for that.. Based on the result of the index scan, we will know the rowkeys where all the data what we are interested in resides. So reseek will happen to those rows and read only those rows. So the time spent at the server side for scanning a region will get reduced to a very high value.

Yes but still there will be calls from the client side to the RS for each region...

Now I think u might be clear.. In the ppt that I have shared, there also it is saying the same thing. It is showing what is happening at the server side.


From: anil gupta [[EMAIL PROTECTED]]
Sent: Tuesday, December 18, 2012 1:58 PM
Subject: Re: HBase - Secondary Index

Hi Anoop,

Please find my reply inline.

Anil Gupta

On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote:

> Hi Anil
>                 During the scan, there is no need to fetch any index data
> to client side. So there is no need to create any scanner on the index
> table at the client side. This happens at the server side.
> For the Scan on the main table with condition on timestamp and customer
> id, a scanner to be created with Filters. Yes like normal when there is no
> secondary index. So this scan from the client will go through all the
> regions in the main table.
Anil: Do you mean that if the table is spread across 50 region servers in
60 node cluster then we need to send a scan request to all the 50 RS.
Right? Doesn't it sounds expensive? IMHO you were not doing this in your
solution. Your solution looked cleaner than this since you exactly knew
which Node you need to go to for querying while using secondary index due
to co-location(due to static begin part for secondary table rowkey) of
region of primary table and secondary index table. My problem is little
more complicated due to the constraints that: I cannot have a "static begin
part" in the rowkey of my secondary table.

When it scans one particular region say (x,y] on the main table, using the
> CP we can get the index table region object corresponding to this main
> table region from the RS.  There is no issue in creating the static part of
> the rowkey. You know 'x' is the region start key. Then at the server side
> will create a scanner on the index region directly and here we can specify
> the startkey. 'x' + <timestamp value> + <customer id>..  Using the results
> from the index scan we will make reseek on the main region to the exact
> rows where the data what we are interested in is available. So there wont
> be a full region data scan happening.

> When in the cases where only timestamp is there but no customer id, it
Anil: I hope now we are on same page. Thanks a lot for your valuable time
to discuss this stuff.
Thanks & Regards,
Anil Gupta