Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Re: HBase - Secondary Index


Copy link to this message
-
RE: HBase - Secondary Index
Yes as you say when the no of rows to be returned is becoming more and more the latency will be becoming more.  seeks within an HFile block is some what expensive op now. (Not much but still)  The new encoding prefix trie will be a huge bonus here. There the seeks will be flying.. [Ted also presented this in the Hadoop China]  Thanks to Matt... :)  I am trying to measure the scan performance with this new encoding . Trying to back port a simple patch for 94 version just for testing...   Yes when the no of results to be returned is more and more any index will become less performing as per my study  :)

>btw, quick question- in your presentation, the scale there is seconds or
mill-seconds:)

It is seconds.  Dont consider the exact values. What is the % of increase in latency is important :) Those were not high end machines.

-Anoop-
________________________________________
From: Shengjie Min [[EMAIL PROTECTED]]
Sent: Thursday, December 27, 2012 9:59 PM
To: [EMAIL PROTECTED]
Subject: Re: HBase - Secondary Index

>Didnt follow u completely here. There wont be any get() happening.. As the
>exact rowkey in a region we get from the index table, we can seek to the
>exact position and return that row.

Sorry, When I misused "get()" here, I meant seeking. Yes, if it's just
small number of rows returned, this works perfect. As you said you will get
the exact rowkey positions per region, and simply seek them. I was trying
to work out the case that when the number of result rows increases
massively. Like in Anil's case, he wants to do a scan query against the
2ndary index(timestamp): "select all rows from timestamp1 to timestamp2"
given no customerId provided. During that time period, he might have a big
chunk of rows from different customerIds. The index table returns a lot of
rowkey positions for different customerIds (I believe they are scattered in
different regions), then you end up seeking all different positions in
different regions and return all the rows needed. According to your
presentation page14 - Performance Test Results (Scan), without index, it's
a linear increase as result rows # increases. on the other hand, with
index, time spent climbs up way quicker than the case without index.

btw, quick question- in your presentation, the scale there is seconds or
mill-seconds:)

- Shengjie
On 27 December 2012 15:54, Anoop John <[EMAIL PROTECTED]> wrote:

> >how the massive number of get() is going to
> perform againt the main table
>
> Didnt follow u completely here. There wont be any get() happening.. As the
> exact rowkey in a region we get from the index table, we can seek to the
> exact position and return that row.
>
> -Anoop-
>
> On Thu, Dec 27, 2012 at 6:37 PM, Shengjie Min <[EMAIL PROTECTED]>
> wrote:
>
> > how the massive number of get() is going to
> > perform againt the main table
> >
>

--
All the best,
Shengjie Min
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB