Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Re: HBase - Secondary Index


Copy link to this message
-
Re: HBase - Secondary Index
Mohit Anchlia 2012-12-28, 03:42
On Thu, Dec 27, 2012 at 7:33 PM, Anoop Sam John <[EMAIL PROTECTED]> wrote:

> Yes as you say when the no of rows to be returned is becoming more and
> more the latency will be becoming more.  seeks within an HFile block is
> some what expensive op now. (Not much but still)  The new encoding prefix
> trie will be a huge bonus here. There the seeks will be flying.. [Ted also
> presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to
> measure the scan performance with this new encoding . Trying to back port a
> simple patch for 94 version just for testing...   Yes when the no of
> results to be returned is more and more any index will become less
> performing as per my study  :)
>
> Do you have link to that presentation?
> >btw, quick question- in your presentation, the scale there is seconds or
> mill-seconds:)
>
> It is seconds.  Dont consider the exact values. What is the % of increase
> in latency is important :) Those were not high end machines.
>
> -Anoop-
> ________________________________________
> From: Shengjie Min [[EMAIL PROTECTED]]
> Sent: Thursday, December 27, 2012 9:59 PM
> To: [EMAIL PROTECTED]
> Subject: Re: HBase - Secondary Index
>
>  >Didnt follow u completely here. There wont be any get() happening.. As
> the
> >exact rowkey in a region we get from the index table, we can seek to the
> >exact position and return that row.
>
> Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just
> small number of rows returned, this works perfect. As you said you will get
> the exact rowkey positions per region, and simply seek them. I was trying
> to work out the case that when the number of result rows increases
> massively. Like in Anil's case, he wants to do a scan query against the
> 2ndary index(timestamp): "select all rows from timestamp1 to timestamp2"
> given no customerId provided. During that time period, he might have a big
> chunk of rows from different customerIds. The index table returns a lot of
> rowkey positions for different customerIds (I believe they are scattered in
> different regions), then you end up seeking all different positions in
> different regions and return all the rows needed. According to your
> presentation page14 - Performance Test Results (Scan), without index, it's
> a linear increase as result rows # increases. on the other hand, with
> index, time spent climbs up way quicker than the case without index.
>
> btw, quick question- in your presentation, the scale there is seconds or
> mill-seconds:)
>
> - Shengjie
>
>
> On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote:
>
> > >how the massive number of get() is going to
> > perform againt the main table
> >
> > Didnt follow u completely here. There wont be any get() happening.. As
> the
> > exact rowkey in a region we get from the index table, we can seek to the
> > exact position and return that row.
> >
> > -Anoop-
> >
> > On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]>
> > wrote:
> >
> > > how the massive number of get() is going to
> > > perform againt the main table
> > >
> >
>
>
>
> --
> All the best,
> Shengjie Min
>