Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: HBase - Secondary Index


Copy link to this message
-
Re: HBase - Secondary Index
anil gupta 2012-12-19, 08:24
Hi Anoop,

For my use case, scans will never have primary table rowkey range whenever
i query using secondary index. IMHO, if i am sending the request to all the
RS of table then i am afraid/concerned of too many unnecessary RPC's across
the cluster for every single query based on secondary index. Essentially
everytime it will look like a full table scan but under the hood the CP's
will do the magic using secondary table.Your solution works well when
rowkey range on primary table can be specified.
Unfortunately, i dont have that luxury for now to use "primary table rowkey
range". It seems like i will have to stick to my current solution. However,
it's always good to have a healthy discussion on different approaches. :)
PS: My current secondary index implementation is not yet in production. I
did some preliminary testing and it seems to work fine but i think i need
to do some more testing.

Thanks,
Anil Gupta
On Tue, Dec 18, 2012 at 1:27 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote:

> Anil:
>     If the scan from client side does not specify any rowkey range but
> only the filter condition, yes it will go to all the primary table regions
> for the scan. There 1st it will scan the index table region and seek to
> exact rows in the main table region.  If that region is not having any data
> at all corresponding to the filter condition, the entire region will get
> skipped simply.
>
> In a normal scan also, if there is a rowkey range that we can specify,
> then only to specific regions the request will go. In the sec index case of
> ours also it is same..
>
> In a simple way what I can say is for the scan there is no change at all
> wrt the operation that is what is happening at the client side. From the
> meta data to know which all region and RSs to contact, and contacting that
> regions one by one and getting data from that region. Only difference is
> what is happening at the server side. With out index the whole data from
> all the Hfiles will get fetched at the server side and the filter will get
> applied for every row. Only those rows which passes the filter will get
> back to the client side.  With index, when the scanning happen at the
> server side, the index data will get scanned 1st from the index region.
> This region will be in the same RS so no extra RPCs. The data to be scanned
> from the index table will be limited.. We can create the start key and stop
> key for that.. Based on the result of the index scan, we will know the
> rowkeys where all the data what we are interested in resides. So reseek
> will happen to those rows and read only those rows. So the time spent at
> the server side for scanning a region will get reduced to a very high value.
>
> Yes but still there will be calls from the client side to the RS for each
> region...
>
> Now I think u might be clear.. In the ppt that I have shared, there also
> it is saying the same thing. It is showing what is happening at the server
> side.
>
> -Anoop-
>
> ________________________________________
> From: anil gupta [[EMAIL PROTECTED]]
> Sent: Tuesday, December 18, 2012 1:58 PM
> To: [EMAIL PROTECTED]
> Subject: Re: HBase - Secondary Index
>
> Hi Anoop,
>
> Please find my reply inline.
>
> Thanks,
> Anil Gupta
>
> On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Anil
> >                 During the scan, there is no need to fetch any index data
> > to client side. So there is no need to create any scanner on the index
> > table at the client side. This happens at the server side.
> >
>
>
> >
> > For the Scan on the main table with condition on timestamp and customer
> > id, a scanner to be created with Filters. Yes like normal when there is
> no
> > secondary index. So this scan from the client will go through all the
> > regions in the main table.
>
>
> Anil: Do you mean that if the table is spread across 50 region servers in
> 60 node cluster then we need to send a scan request to all the 50 RS.
> Right? Doesn't it sounds expensive? IMHO you were not doing this in your

Thanks & Regards,
Anil Gupta