Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Scan performance


+
Tony Dean 2013-06-21, 21:08
+
lars hofhansl 2013-06-21, 22:37
+
Vladimir Rodionov 2013-06-22, 00:00
+
lars hofhansl 2013-06-22, 13:24
+
Tony Dean 2013-06-24, 20:48
+
lars hofhansl 2013-06-24, 21:05
+
Tony Dean 2013-06-25, 00:21
+
Tony Dean 2013-07-02, 21:31
+
Ted Yu 2013-07-02, 22:11
+
Tony Dean 2013-07-17, 01:29
+
Tony Dean 2013-07-03, 14:59
+
Tony Dean 2013-06-22, 03:50
+
Anoop John 2013-06-22, 06:58
+
lars hofhansl 2013-06-22, 13:29
Copy link to this message
-
Re: Scan performance
Hi Tony,
Have you had a look at Phoenix(https://github.com/forcedotcom/phoenix), a SQL skin over HBase? It has a skip scan that will let you model a multi part row key and skip through it efficiently as you've described. Take a look at this blog for more info: http://phoenix-hbase.blogspot.com/2013/05/demystifying-skip-scan-in-phoenix.html?m=1

Regards,
James

On Jun 22, 2013, at 6:29 AM, "lars hofhansl" <[EMAIL PROTECTED]> wrote:

> Yep generally you should design your keys such that start/stopKey can efficiently narrow the scope.
>
> If that really cannot be done (and you should try hard), the 2nd  best option are "skip scans".
>
> Filters in HBase allow for providing the scanner framework with hints where to go next.
> They can skip to the next column (to avoid looking at many versions), to the next row (to avoid looking at many columns), or they can provide a custom seek hint to a specific key value. The latter is what FuzzyRowFilter does.
>
>
> -- Lars
>
>
>
> ________________________________
> From: Anoop John <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, June 21, 2013 11:58 PM
> Subject: Re: Scan performance
>
>
> Have a look at FuzzyRowFilter
>
> -Anoop-
>
> On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <[EMAIL PROTECTED]> wrote:
>
>> I understand more, but have additional questions about the internals...
>>
>> So, in this example I have 6000 rows X 40 columns in this table.  In this
>> test my startRow and stopRow do not narrow the scan criterior therefore all
>> 6000x40 KVs must be included in the search and thus read from disk and into
>> memory.
>>
>> The first filter that I used was:
>> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
>> CompareFilter.CompareOp.EQUALS, value);
>>
>> This means that HBase must look for the qualifier column on all 6000 rows.
>> As you mention I could add certain columns to a different cf; but
>> unfortunately, in my case there is no such small set of columns that will
>> need to be compared (filtered on).  I could try to use indexes so that a
>> complete row key can be calculated from a secondary index in order to
>> perform a faster search against data in a primary table.  This requires
>> additional tables and maintenance that I would like to avoid.
>>
>> I did try a row key filter with regex hoping that it would limit the
>> number of rows that were read from disk.
>> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
>> RegexStringComparator(row_regexpr));
>>
>> My row keys are something like: vid,sid,event.  sid is not known at query
>> time so I can use a regex similar to: vid,.*,Logon where Logon is the event
>> that I am looking for in a particular visit.  In my test data this should
>> have narrowed the scan to 1 row X 40 columns.  The best I could do for
>> start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
>> going to cause all 6000 rows to be scanned, but the filtering should be
>> more specific with the rowKey filter.  However, I did not see any
>> performance improvement.  Anything obvious?
>>
>> Do you have any other ideas to help out with performance when row key is:
>> vid,sid,event and sid is not known at query time which leaves a gap in the
>> start/stop row?  Too bad regex can't be used in start/stop row
>> specification.  That's really what I need.
>>
>> Thanks again.
>> -Tony
>>
>> -----Original Message-----
>> From: Vladimir Rodionov [mailto:[EMAIL PROTECTED]]
>> Sent: Friday, June 21, 2013 8:00 PM
>> To: [EMAIL PROTECTED]; lars hofhansl
>> Subject: RE: Scan performance
>>
>> Lars,
>> I thought that column family is the locality group and placement columns
>> which are frequently accessed together into the same column family
>> (locality group) is the obvious performance improvement tip. What are the
>> "essential column families" for in this context?
>>
>> As for original question..  Unless you place your column into a separate
>> column family in Table 2, you will need to scan (load from disk if not
+
Tony Dean 2013-06-24, 20:39
+
Tony Dean 2013-07-17, 03:07
+
Viral Bajaria 2013-08-08, 20:33
+
Jean-Marc Spaggiari 2013-10-19, 11:46
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB