Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Best technique for doing lookup with Secondary Index


Copy link to this message
-
Re: Best technique for doing lookup with Secondary Index
>
> Now your main question is lookups right
> Now there are some more hooks in the scan flow called pre/postScannerOpen,
> pre/postScannerNext.
> May be you can try using them to do a look up on the secondary table and
> then use those values and pass it to the main table next().
>

In secondary index its hard to avoid at-least two RPC calls(1 from client
to table B and then from table B to Table A) whether you use coproc or not.
But, i believe using coproc is better than doing RPC calls from client
since it might be outside the subnet/network of cluster. In this case, the
RPC will be faster when we use coprocs. In my case the client is certainly
not in the same subnet or network zone. I need to provide results of query
in around 100 milliseconds or less so i need to be really frugal. Let me
know your views on this.

Have you implemented queries with Secondary indexes using coproc yet?
At present i have tried the client side query and i can get the results of
query in around 100 ms. I am enticed to try out the coproc implementation.

But this may involve more RPC calls as your regions of "A" and "B" may be in
> different RS.
>
AFAIK, RPC cannot be avoided even if Region A and Region B are on same RS
since these two regions are from different table. Am i right?
Thanks,
Anil Gupta

On Thu, Oct 25, 2012 at 9:20 PM, Ramkrishna.S.Vasudevan <
[EMAIL PROTECTED]> wrote:

> > Is it a
> > good idea to create Htable instance on "B" and do put in my mapper? I
> > might
> > try this idea.
> Yes you can do this..  May be the same mapper you can do a put for table
> "B".  This was how we have tried loading data to another table by using the
> main table "A"
> Puts.
>
> Now your main question is lookups right
> Now there are some more hooks in the scan flow called pre/postScannerOpen,
> pre/postScannerNext.
> May be you can try using them to do a look up on the secondary table and
> then use those values and pass it to the main table next().
> But this may involve more RPC calls as your regions of "A" and "B" may be
> in
> different RS.
>
> If something is wrong in my understanding of what you said, kindly spare
> me.
> :)
>
> Regards
> Ram
>
>
> > -----Original Message-----
> > From: anil gupta [mailto:[EMAIL PROTECTED]]
> > Sent: Friday, October 26, 2012 3:40 AM
> > To: [EMAIL PROTECTED]
> > Subject: Re: Best technique for doing lookup with Secondary Index
> >
> > Anoop:  In prePut hook u call HTable#put()?
> > Anil: Yes i call HTable#put() in prePut. Is there better way of doing
> > it?
> >
> > Anoop: Why use the network calls from server side here then?
> > Anil: I thought this is a cleaner approach since i am using BulkLoader.
> > I
> > decided not to run two jobs since i am generating a UniqueIdentifier at
> > runtime in bulkloader.
> >
> > Anoop: can not handle it from client alone?
> > Anil: I cannot handle it from client since i am using BulkLoader. Is it
> > a
> > good idea to create Htable instance on "B" and do put in my mapper? I
> > might
> > try this idea.
> >
> > Anoop: You can have a look at Lily project.
> > Anil: It's little late for us to evaluate Lily now and at present we
> > dont
> > need complex secondary index since our data is immutable.
> >
> > Ram: what is rowkey B here?
> > Anil: Suppose i am storing customer events in table A. I have two
> > requirement for data query:
> > 1. Query customer events on basis of customer_Id and event_ID.
> > 2. Query customer events on basis of event_timestamp and customer_ID.
> >
> > 70% of querying is done by query#1, so i will create
> > <customer_Id><event_ID> as row key of Table A.
> > Now, in order to support fast results for query#2, i need to create a
> > secondary index on A. I store that secondary index in B, rowkey of B is
> > <event_timestamp><customer_ID>  .Every row stores the corresponding
> > rowkey
> > of A.
> >
> > Ram:How is the startRow determined for every query?
> > Anil: Its determined by a very simple application logic.
> >
> > Thanks,
Thanks & Regards,
Anil Gupta
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB