Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Scan performance

Copy link to this message
Re: Scan performance
Have a look at FuzzyRowFilter


On Sat, Jun 22, 2013 at 9:20 AM, Tony Dean <[EMAIL PROTECTED]> wrote:

> I understand more, but have additional questions about the internals...
> So, in this example I have 6000 rows X 40 columns in this table.  In this
> test my startRow and stopRow do not narrow the scan criterior therefore all
> 6000x40 KVs must be included in the search and thus read from disk and into
> memory.
> The first filter that I used was:
> Filter f2 = new SingleColumnValueFilter(cf, qualifier,
>  CompareFilter.CompareOp.EQUALS, value);
> This means that HBase must look for the qualifier column on all 6000 rows.
>  As you mention I could add certain columns to a different cf; but
> unfortunately, in my case there is no such small set of columns that will
> need to be compared (filtered on).  I could try to use indexes so that a
> complete row key can be calculated from a secondary index in order to
> perform a faster search against data in a primary table.  This requires
> additional tables and maintenance that I would like to avoid.
> I did try a row key filter with regex hoping that it would limit the
> number of rows that were read from disk.
> Filter f2 = new RowFilter(CompareFilter.CompareOp.EQUAL, new
> RegexStringComparator(row_regexpr));
> My row keys are something like: vid,sid,event.  sid is not known at query
> time so I can use a regex similar to: vid,.*,Logon where Logon is the event
> that I am looking for in a particular visit.  In my test data this should
> have narrowed the scan to 1 row X 40 columns.  The best I could do for
> start/stop row is: vid,0 and vid,~ respectively.  I guess that is still
> going to cause all 6000 rows to be scanned, but the filtering should be
> more specific with the rowKey filter.  However, I did not see any
> performance improvement.  Anything obvious?
> Do you have any other ideas to help out with performance when row key is:
> vid,sid,event and sid is not known at query time which leaves a gap in the
> start/stop row?  Too bad regex can't be used in start/stop row
> specification.  That's really what I need.
> Thanks again.
> -Tony
> -----Original Message-----
> From: Vladimir Rodionov [mailto:[EMAIL PROTECTED]]
> Sent: Friday, June 21, 2013 8:00 PM
> To: [EMAIL PROTECTED]; lars hofhansl
> Subject: RE: Scan performance
> Lars,
> I thought that column family is the locality group and placement columns
> which are frequently accessed together into the same column family
> (locality group) is the obvious performance improvement tip. What are the
> "essential column families" for in this context?
> As for original question..  Unless you place your column into a separate
> column family in Table 2, you will need to scan (load from disk if not
> cached) ~ 40x more data for the second table (because you have 40 columns).
> This may explain why do  see such a difference in execution time if all
> data needs to be loaded first from HDFS.
> Best regards,
> Vladimir Rodionov
> Principal Platform Engineer
> Carrier IQ, www.carrieriq.com
> ________________________________________
> From: lars hofhansl [[EMAIL PROTECTED]]
> Sent: Friday, June 21, 2013 3:37 PM
> Subject: Re: Scan performance
> HBase is a key value (KV) store. Each column is stored in its own KV, a
> row is just a set of KVs that happen to have the row key (which is the
> first part of the key).
> I tried to summarize this here:
> http://hadoop-hbase.blogspot.de/2011/12/introduction-to-hbase.html)
> In the StoreFiles all KVs are sorted in row/column order, but HBase still
> needs to skip over many KVs in order to "reach" the next row. So more disk
> and memory IO is needed.
> If you using 0.94 there is a new feature "essential column families". If
> you always search by the same column you can place that one in its own
> column family and all other column in another column family. In that case
> your scan performance should be close identical.