Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> Scanner with explicit columns list is very slow


Copy link to this message
-
Re: Scanner with explicit columns list is very slow
Yes, I load data into HRegion (with CACHE_ON_WRITE) than call flashcache()
(no data in memstore).

This is what I found: the default implementation of  ExplicitColumnMatcher
is (possibly) tuned to very large rows, I would say - very large. We need a
hint for scan which  tells StoreScanner which strategy to use :

1. ExplicitColumnMatcher with reseeks (what we have currently) for very
large rows
Or for small/medium rows
2. Remove explicit columns/families  from a Scan and replace them with
additional filter which actually keeps columnFamilyMap from scan and
verifies every KV  matches with this map.

I have created such a filter (ExplicitScanReplacementFilter) and verified
that it works much better than case 1. for small rows. For 1 CF + 5 CQs and
Scan with 2 CQs I have:

400K rows per sec with default
1.25M with ExplicitScanReplacementFilter

ExplicitScanReplacementFilter I will optimize even more and will probably
get tomorrow 1.4-1.5M rows per sec.
We need a JIRA and I will open one tomorrow.

-Vladimir
On Mon, Oct 14, 2013 at 9:38 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Interesting. Thanks for doing the testing/profiling Vladimir!
>
>
> Generally reseeks are better if they can skip many KVs.
>
> For example if you have many versions of the same row/col,
> INCLUDE_NEXT_COL will be better than issuing many INCLUDEs, same with
> INCLUDE_NEXT_ROW if there are many columns.
>
> Since the number of columns/versions is not known at scan time (and can in
> fact vary between rows) it is hard to always do the right thing. It also
> depends on how large the KVs are average. So replacing INCLUDE_NEXT_XXX
> with INCLUDE is not always the right idea.
>
>
> Thinking aloud... We could take the VERSIONS setting of the column family
> into account as a guideline for the expected number of versions (but
> there's no guarantee about how many version we'll actually have until we
> had a compaction), and replace INCLUDE_NEXT_COL with INCLUDE if VERSIONS is
> small (maybe < 10 or so). Maybe that'd be worth a jira...
>
>
> There are some fixes in 0.94.12 (HBASE-8930, avoid a superfluous reseek in
> some cases), and HBASE-9732 might help in 0.94.13 (avoid memory fences on
> an volatile on each seek/reseek).
>
> It also would be nice to figure out why reseek is so much more expensive.
> If the KV we reseek to is on the same block it should just scan forward,
> otherwise it'll look in the appropriate block. It probably is the creation
> of the fake KV we want to seek to (like firstOnRow, lastOnRow, etc), which
> case there's not much we can.
>
>
> Lastly, I've not spend much time profiling the ExplicitColumnMatcher, yet,
> looks like I should start doing that.
>
>
> So in your case everything is in the blockcache, no data in the memstore?
>
> -- Lars
>
>
>
> ________________________________
>  From: Vladimir Rodionov <[EMAIL PROTECTED]>
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Sent: Monday, October 14, 2013 2:49 PM
> Subject: Re: Scanner with explicit columns list is very slow
>
>
> One fast optimization:
>
> There is no need to call reseek on INCLUDE_NEXT_COL - this is going to be
> the same row in the same KeyValueScanner (currently on top of
> KeyValueHeap).
>
>
>
>
>
> On Mon, Oct 14, 2013 at 2:46 PM, Vladimir Rodionov
> <[EMAIL PROTECTED]>wrote:
>
> > I profiled the last test case (5 columns total and 2 in a scan).
> >
> > 80% of StoreScanner.next() execution time are in :
> >
> > StoreScanner.reseek() - 71%
> > ScanQueryMathcer.getKeyForNextColumn() - 6%
> > ScanQueryMathcer.getKeyForNextRow() - 2%
> >
> > Should I open JIRA?
> >
> >
> > On Mon, Oct 14, 2013 at 2:03 PM, Vladimir Rodionov <
> [EMAIL PROTECTED]
> > > wrote:
> >
> >> I modified tests:
> >>
> >> Now I created table with one CF and 5 columns: CQ1,..,CQ5
> >>
> >> 1. Scan.addColumn(CF, CQ1);
> >>     Scan.addColumn(CF, CQ3);
> >>
> >> 2. Scan.addFamily(CF);
> >>
> >> Scan performance from block cache:
> >>
> >> 1.  400K rows per sec
> >> 2.  1.6M rows per sec
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB