Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Scan performance


Copy link to this message
-
Re: Scan performance
Tony:
Take a look at
http://blog.sematext.com/2012/08/09/consider-using-fuzzyrowfilter-when-in-need-for-secondary-indexes-in-hbase/

Cheers

On Tue, Jul 2, 2013 at 2:31 PM, Tony Dean <[EMAIL PROTECTED]> wrote:

> The following information is what I discovered from Scan performance
> testing.
>
> Setup
> -------
> row key format:
> positiion1,position2,position3
> where position1 is a fixed literal, and position2 and position3 are
> variable data.
>
> I have created data with 6000 rows with ~40 columns in each row.  The
> table contains only 1 column family.
>
> The row that I want to query is:
> vid,sid-0,Logon    event:customer value=?
>
> -------
>
> Case 1:
> use fully qualified row specification in start/stop row key (e.g.,
> vid,sid-0,Logon) with a SingleColumnValueFilter in the Scan.
>
> avg response time to get Scan iterator and iterate the single result is
> ~5ms.  This is expected.
>
>
> Case 2:
> This is the normal case where position2 in the row key is unknown at the
> time of the query: vid,?,Logon.
> Using a SingleColumnValueFilter in the Scan, the avg response time to get
> Scan iterator and iterate the single result is ~100ms.
> This is the use case that I'm trying to improve upon.
>
> Case 3:
> After upgrading to 0.94.8 I was able to change Case2 by using
> FuzzyRowFilter instead of SingleColumnValueFilter.  It's a good candidate
> since I know position1 and position3.
> The avg response time to get Scan iterator and iterate the single result
> was ~5ms (pretty much the same response time as case 1 where I knew the
> complete row key).
>
> I didn't expect such an improvement.  Can you explain how FuzzyRowFilter
> optimizes scanning rows from disk?  In my case it needs to scan rows
> (vid,?,xxxx) until xxxx is greater than "Logon".  Then it can just stop
> after that; thereby optimizing the scan, correct?  So, optimization using
> FuzzyRowFilter is very dependent upon the data that you are scanning.
>
> Thanks for any insight.
>
>
> -----Original Message-----
> From: lars hofhansl [mailto:[EMAIL PROTECTED]]
> Sent: Monday, June 24, 2013 5:05 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Scan performance
>
> RowFilter can help. It depends on the setup.
> RowFilter skip all column of the row when the row key does not match.
> That will help with IO *if* your rows are larger than the HFile block size
> (64k by default). Otherwise it still needs to touch each block.
>
> An HTable does some priming when it is created. The region information for
> all tables could be substantial, so it does not make much sense to prime
> the cache for all tables.
> How are you using the client. If you pre-create a reuse HTable and/or
> HConnection you should be OK.
>
>
> -- Lars
>
>
>
> ________________________________
>  From: Tony Dean <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <
> [EMAIL PROTECTED]>
> Sent: Monday, June 24, 2013 1:48 PM
> Subject: RE: Scan performance
>
>
> Lars,
> I'm waiting for some time to exchange out hbase jars in cluster (that
> support FuzzyRow filter) in order to try out.  In the meantime, I'm
> wondering why RowFilter regex is not more helpful.  I'm guessing that
> FuzzyRow filter helps in disk io while Row filter just filters after the
> disk io has completed.  Also, I turned on row level bloom filter which does
> not seem to help either.
>
> On a different performance note, I'm wondering if there is a way to prime
> client connection information and such so that the first client query isn't
> miserably slow.  After the first query, response times do get considerably
> better due to caching necessary information.  Is there a way to get around
> this first initial hit?  I assume any such priming would have to be
> application specific.
>
> Thanks.
>
> -----Original Message-----
> From: lars hofhansl [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, June 22, 2013 9:24 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Scan performance
>
> "essential column families" help when you filter on one column but want to