Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Scan performance


+
Tony Dean 2013-06-21, 21:08
+
lars hofhansl 2013-06-21, 22:37
+
Vladimir Rodionov 2013-06-22, 00:00
+
lars hofhansl 2013-06-22, 13:24
+
Tony Dean 2013-06-24, 20:48
+
lars hofhansl 2013-06-24, 21:05
+
Tony Dean 2013-06-25, 00:21
+
Tony Dean 2013-07-02, 21:31
Copy link to this message
-
Re: Scan performance
Tony:
Take a look at
http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/

Cheers

On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <[EMAIL PROTECTED]> wrote:

> The following information is what I discovered from Scan performance
> testing.
>
> Setup
> -------
> row key format:
> positiion1,position2,position3
> where position1 is a fixed literal, and position2 and position3 are
> variable data.
>
> I have created data with 6000 rows with ~40 columns in each row.  The
> table contains only 1 column family.
>
> The row that I want to query is:
> vid,sid-0,Logon    event:customer value=?
>
> -------
>
> Case 1:
> use fully qualified row specification in start/stop row key (e.g.,
> vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan.
>
> avg response time to get Scan iterator and iterate the single result is
> ~5ms.  This is expected.
>
>
> Case 2:
> This is the normal case where position2 in the row key is unknown at the
> time of the query: vid,?,Logon.
> Using a SingleColumnValueFilter in the Scan, the avg response time to get
> Scan iterator and iterate the single result is ~100ms.
> This is the use case that I'm trying to improve upon.
>
> Case 3:
> After upgrading to 0.94.8 I was able to change Case2 by using
> FuzzyRowFilter instead of SingleColumnValueFilter.  It's a good candidate
> since I know position1 and position3.
> The avg response time to get Scan iterator and iterate the single result
> was ~5ms (pretty much the same response time as case 1 where I knew the
> complete row key).
>
> I didn't expect such an improvement.  Can you explain how FuzzyRowFilter
> optimizes scanning rows from disk?  In my case it needs to scan rows
> (vid,?,xxxx) until xxxx is greater than "Logon".  Then it can just stop
> after that; thereby optimizing the scan, correct?  So, optimization using
> FuzzyRowFilter is very dependent upon the data that you are scanning.
>
> Thanks for any insight.
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 24, 2013 5:05 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Scan performance
>
> RowFilter can help. It depends on the setup.
> RowFilter skip all column of the row when the row key does not match.
> That will help with IO *if* your rows are larger than the HFile block size
> (64k by default). Otherwise it still needs to touch each block.
>
> An HTable does some priming when it is created. The region information for
> all tables could be substantial, so it does not make much sense to prime
> the cache for all tables.
> How are you using the client. If you pre-create a reuse HTable and/or
> HConnection you should be OK.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Tony Dean <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <
> [EMAIL PROTECTED]>
> Sent: Monday, June 24, 2013 1:48 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I'm waiting for some time to exchange out hbase jars in cluster (that
> support FuzzyRow filter) in order to try out.  In the meantime, I'm
> wondering why RowFilter regex is not more helpful.  I'm guessing that
> FuzzyRow filter helps in disk io while Row filter just filters after the
> disk io has completed.  Also, I turned on row level bloom filter which does
> not seem to help either.
>
> On a different performance note, I'm wondering if there is a way to prime
> client connection information and such so that the first client query isn't
> miserably slow.  After the first query, response times do get considerably
> better due to caching necessary information.  Is there a way to get around
> this first initial hit?  I assume any such priming would have to be
> application specific.
>
> Thanks.
>
> -----Original Message-----
> From: lars hofhansl [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, June 22, 2013 9:24 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Scan performance
>
> "essential column families" help when you filter on one column but want to
+
Tony Dean 2013-07-17, 01:29
+
Tony Dean 2013-07-03, 14:59
+
Tony Dean 2013-06-22, 03:50
+
Anoop John 2013-06-22, 06:58
+
lars hofhansl 2013-06-22, 13:29
+
James Taylor 2013-06-22, 17:17
+
Tony Dean 2013-06-24, 20:39
+
Tony Dean 2013-07-17, 03:07
+
Viral Bajaria 2013-08-08, 20:33
+
Jean-Marc Spaggiari 2013-10-19, 11:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB