-Re: Scanner with explicit columns list is very slow
Vladimir Rodionov 2013-10-15, 05:28
Yes, I load data into HRegion (with CACHE_ON_WRITE) than call flashcache()
(no data in memstore).
This is what I found: the default implementation of ExplicitColumnMatcher
is (possibly) tuned to very large rows, I would say - very large. We need a
hint for scan which tells StoreScanner which strategy to use :
1. ExplicitColumnMatcher with reseeks (what we have currently) for very
Or for small/medium rows
2. Remove explicit columns/families from a Scan and replace them with
additional filter which actually keeps columnFamilyMap from scan and
verifies every KV matches with this map.
I have created such a filter (ExplicitScanReplacementFilter) and verified
that it works much better than case 1. for small rows. For 1 CF + 5 CQs and
Scan with 2 CQs I have:
400K rows per sec with default
1.25M with ExplicitScanReplacementFilter
ExplicitScanReplacementFilter I will optimize even more and will probably
get tomorrow 1.4-1.5M rows per sec.
We need a JIRA and I will open one tomorrow.
On Mon, Oct 14, 2013 at 9:38 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> Interesting. Thanks for doing the testing/profiling Vladimir!
> Generally reseeks are better if they can skip many KVs.
> For example if you have many versions of the same row/col,
> INCLUDE_NEXT_COL will be better than issuing many INCLUDEs, same with
> INCLUDE_NEXT_ROW if there are many columns.
> Since the number of columns/versions is not known at scan time (and can in
> fact vary between rows) it is hard to always do the right thing. It also
> depends on how large the KVs are average. So replacing INCLUDE_NEXT_XXX
> with INCLUDE is not always the right idea.
> Thinking aloud... We could take the VERSIONS setting of the column family
> into account as a guideline for the expected number of versions (but
> there's no guarantee about how many version we'll actually have until we
> had a compaction), and replace INCLUDE_NEXT_COL with INCLUDE if VERSIONS is
> small (maybe < 10 or so). Maybe that'd be worth a jira...
> There are some fixes in 0.94.12 (HBASE-8930, avoid a superfluous reseek in
> some cases), and HBASE-9732 might help in 0.94.13 (avoid memory fences on
> an volatile on each seek/reseek).
> It also would be nice to figure out why reseek is so much more expensive.
> If the KV we reseek to is on the same block it should just scan forward,
> otherwise it'll look in the appropriate block. It probably is the creation
> of the fake KV we want to seek to (like firstOnRow, lastOnRow, etc), which
> case there's not much we can.
> Lastly, I've not spend much time profiling the ExplicitColumnMatcher, yet,
> looks like I should start doing that.
> So in your case everything is in the blockcache, no data in the memstore?
> -- Lars
> From: Vladimir Rodionov <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Monday, October 14, 2013 2:49 PM
> Subject: Re: Scanner with explicit columns list is very slow
> One fast optimization:
> There is no need to call reseek on INCLUDE_NEXT_COL - this is going to be
> the same row in the same KeyValueScanner (currently on top of
> On Mon, Oct 14, 2013 at 2:46 PM, Vladimir Rodionov
> <[EMAIL PROTECTED]>wrote:
> > I profiled the last test case (5 columns total and 2 in a scan).
> > 80% of StoreScanner.next() execution time are in :
> > StoreScanner.reseek() - 71%
> > ScanQueryMathcer.getKeyForNextColumn() - 6%
> > ScanQueryMathcer.getKeyForNextRow() - 2%
> > Should I open JIRA?
> > On Mon, Oct 14, 2013 at 2:03 PM, Vladimir Rodionov <
> [EMAIL PROTECTED]
> > > wrote:
> >> I modified tests:
> >> Now I created table with one CF and 5 columns: CQ1,..,CQ5
> >> 1. Scan.addColumn(CF, CQ1);
> >> Scan.addColumn(CF, CQ3);
> >> 2. Scan.addFamily(CF);
> >> Scan performance from block cache:
> >> 1. 400K rows per sec
> >> 2. 1.6M rows per sec