Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Scan performance


Copy link to this message
-
Re: Scan performance
Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL skin over HBase? It has a skip scan that will let you model a multi part row key and skip through it efficiently as you've described. Take a look at this blog for more info: http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1

Regards,
James

On Jun 22, 2013, at 6:29 AM, "lars hofhansl" <[EMAIL PROTECTED]> wrote:

> Yep generally you should design your keys such that start/stopKey can efficiently narrow the scope.
>
> If that really cannot be done (and you should try hard), the 2nd  best option are "skip scans".
>
> Filters in HBase allow for providing the scanner framework with hints where to go next.
> They can skip to the next column (to avoid looking at many versions), to the next row (to avoid looking at many columns), or they can provide a custom seek hint to a specific key value. The latter is what FuzzyRowFilter does.
>
>
> -- Lars
>
>
>
> ________________________________
> From: Anoop John <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, June 21, 2013 11:58 PM
> Subject: Re: Scan performance
>
>
> Have a look at FuzzyRowFilter
>
> -Anoop-
>
> On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <[EMAIL PROTECTED]> wrote:
>
>> I understand more, but have additional questions about the internals...
>>
>> So, in this example I have 6000 rows X 40 columns in this table.  In this
>> test my startRow and stopRow do not narrow the scan criterior therefore all
>> 6000x40 KVs must be included in the search and thus read from disk and into
>> memory.
>>
>> The first filter that I used was:
>> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
>> CompareFilter.CompareOp.EQUALS, value);
>>
>> This means that HBase must look for the qualifier column on all 6000 rows.
>> As you mention I could add certain columns to a different cf; but
>> unfortunately, in my case there is no such small set of columns that will
>> need to be compared (filtered on).  I could try to use indexes so that a
>> complete row key can be calculated from a secondary index in order to
>> perform a faster search against data in a primary table.  This requires
>> additional tables and maintenance that I would like to avoid.
>>
>> I did try a row key filter with regex hoping that it would limit the
>> number of rows that were read from disk.
>> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
>> RegexStringComparator(row_regexpr));
>>
>> My row keys are something like: vid,sid,event.  sid is not known at query
>> time so I can use a regex similar to: vid,.*,Logon where Logon is the event
>> that I am looking for in a particular visit.  In my test data this should
>> have narrowed the scan to 1 row X 40 columns.  The best I could do for
>> start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
>> going to cause all 6000 rows to be scanned, but the filtering should be
>> more specific with the rowKey filter.  However, I did not see any
>> performance improvement.  Anything obvious?
>>
>> Do you have any other ideas to help out with performance when row key is:
>> vid,sid,event and sid is not known at query time which leaves a gap in the
>> start/stop row?  Too bad regex can't be used in start/stop row
>> specification.  That's really what I need.
>>
>> Thanks again.
>> -Tony
>>
>> -----Original Message-----
>> From: Vladimir Rodionov [mailto:[EMAIL PROTECTED]]
>> Sent: Friday, June 21, 2013 8:00 PM
>> To: [EMAIL PROTECTED]; lars hofhansl
>> Subject: RE: Scan performance
>>
>> Lars,
>> I thought that column family is the locality group and placement columns
>> which are frequently accessed together into the same column family
>> (locality group) is the obvious performance improvement tip. What are the
>> "essential column families" for in this context?
>>
>> As for original question..  Unless you place your column into a separate
>> column family in Table 2, you will need to scan (load from disk if not