Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: HBase - Secondary Index


Copy link to this message
-
Re: HBase - Secondary Index
Hi Anoop,

For my use case, scans will never have primary table rowkey range whenever
i query using secondary index. IMHO, if i am sending the request to all the
RS of table then i am afraid/concerned of too many unnecessary RPC's across
the cluster for every single query based on secondary index. Essentially
everytime it will look like a full table scan but under the hood the CP's
will do the magic using secondary table.Your solution works well when
rowkey range on primary table can be specified.
Unfortunately, i dont have that luxury for now to use "primary table rowkey
range". It seems like i will have to stick to my current solution. However,
it's always good to have a healthy discussion on different approaches. :)
PS: My current secondary index implementation is not yet in production. I
did some preliminary testing and it seems to work fine but i think i need
to do some more testing.

Thanks,
Anil Gupta
On Tue, Dec 18, 2012 at 1:27 AM, Anoop Sam John <[EMAIL PROTECTED]> wrote:

> Anil:
>     If the scan from client side does not specify any rowkey range but
> only the filter condition, yes it will go to all the primary table regions
> for the scan. There 1st it will scan the index table region and seek to
> exact rows in the main table region.  If that region is not having any data
> at all corresponding to the filter condition, the entire region will get
> skipped simply.
>
> In a normal scan also, if there is a rowkey range that we can specify,
> then only to specific regions the request will go. In the sec index case of
> ours also it is same..
>
> In a simple way what I can say is for the scan there is no change at all
> wrt the operation that is what is happening at the client side. From the
> meta data to know which all region and RSs to contact, and contacting that
> regions one by one and getting data from that region. Only difference is
> what is happening at the server side. With out index the whole data from
> all the Hfiles will get fetched at the server side and the filter will get
> applied for every row. Only those rows which passes the filter will get
> back to the client side.  With index, when the scanning happen at the
> server side, the index data will get scanned 1st from the index region.
> This region will be in the same RS so no extra RPCs. The data to be scanned
> from the index table will be limited.. We can create the start key and stop
> key for that.. Based on the result of the index scan, we will know the
> rowkeys where all the data what we are interested in resides. So reseek
> will happen to those rows and read only those rows. So the time spent at
> the server side for scanning a region will get reduced to a very high value.
>
> Yes but still there will be calls from the client side to the RS for each
> region...
>
> Now I think u might be clear.. In the ppt that I have shared, there also
> it is saying the same thing. It is showing what is happening at the server
> side.
>
> -Anoop-
>
> ________________________________________
> From: anil gupta [[EMAIL PROTECTED]]
> Sent: Tuesday, December 18, 2012 1:58 PM
> To: [EMAIL PROTECTED]
> Subject: Re: HBase - Secondary Index
>
> Hi Anoop,
>
> Please find my reply inline.
>
> Thanks,
> Anil Gupta
>
> On Sun, Dec 16, 2012 at 8:02 PM, Anoop Sam John <[EMAIL PROTECTED]>
> wrote:
>
> > Hi Anil
> >                 During the scan, there is no need to fetch any index data
> > to client side. So there is no need to create any scanner on the index
> > table at the client side. This happens at the server side.
> >
>
>
> >
> > For the Scan on the main table with condition on timestamp and customer
> > id, a scanner to be created with Filters. Yes like normal when there is
> no
> > secondary index. So this scan from the client will go through all the
> > regions in the main table.
>
>
> Anil: Do you mean that if the table is spread across 50 region servers in
> 60 node cluster then we need to send a scan request to all the 50 RS.
> Right? Doesn't it sounds expensive? IMHO you were not doing this in your

Thanks & Regards,
Anil Gupta
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB