Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - Beware of PREFIX_TREE block encoding


Copy link to this message
-
Re: Beware of PREFIX_TREE block encoding
Matt Corgan 2013-10-21, 22:01
Prefix-tree benefits when you have data with long keys and short values and
you're doing many small gets or multi-gets.  Like Lars said, it can also
expand the effective size of your block cache.

To speed it up we'll need to have the scanners and KeyValueHeap operate
directly on Cells rather than copying each Cell to a KeyValue first.  I
haven't had time to work on it this year:
https://issues.apache.org/jira/browse/HBASE-7319
https://issues.apache.org/jira/browse/HBASE-7323

On Sun, Oct 20, 2013 at 10:06 AM, Vladimir Rodionov
<[EMAIL PROTECTED]>wrote:

> FAST_DIFF:
> Time to read all 1.3M rows reported in ms.
>
> encoding  = NONE,                scanner = StoreScanner;      time = 300
>  ms
> encoding  = PREFIX_TREE,  scanner = StoreScanner;      time = 860  ms
> encoding  = FAST_DIFF,        scanner = StoreScanner;      time = 460  ms
> encoding  = NONE              ,  scanner = StoreFileScanner; time = 52   ms
> encoding  = PREFIX_TREE,  scanner = StoreFileScanner; time = 545 ms
> encoding  = FAST_DIFF,        scanner = StoreFileScanner; time = 195  ms
>
> -Vladimir
>
>
>
> On Sun, Oct 20, 2013 at 4:06 AM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>
> > Vladimir, any chance to run the same test with FAST_DIFF?
> >
> > J
> >
> >
> > 2013/10/20 Vladimir Rodionov <[EMAIL PROTECTED]>
> >
> > > I wanted to try PREFIX_TREE because it is supposed to be fastest on
> > > seek/reseek.
> > >
> > >
> > > On Sat, Oct 19, 2013 at 9:12 PM, lars hofhansl <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > I found FAST_DIFF to be the fastest of the block encoders.
> > > > (Prefix tree is in 0.96+ only as far as I know.)
> > > >
> > > > -- Lars
> > > >
> > > >
> > > >
> > > > ----- Original Message -----
> > > > From: Vladimir Rodionov <[EMAIL PROTECTED]>
> > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; lars hofhansl <
> > > > [EMAIL PROTECTED]>
> > > > Cc:
> > > > Sent: Saturday, October 19, 2013 9:08 PM
> > > > Subject: Re: Beware of PREFIX_TREE block encoding
> > > >
> > > > *Now, which encoder did you test specifically? I seen a 20-40%
> slowdown
> > > > when everything is in the blockcache (which is the worst case
> scenario
> > > > here), certainly not a 10x slowdown.*
> > > >
> > > > I have 1.3M rows (very small - 48 bytes) in a block cache which I
> read
> > > > sequentially, using encoding NONE, PREFIX_TREE and
> > > > StoreScanner/StoreFileScanner (close to metal - block cache :)
> > > >
> > > > Time to read all 1.3M rows reported in ms.
> > > >
> > > > encoding  = NONE,                scanner = StoreScanner;      time > > 300
> > > > ms
> > > > encoding  = PREFIX_TREE,  scanner = StoreScanner;      time = 860  ms
> > > > encoding  = NONE              ,  scanner = StoreFileScanner; time > 52
> > > ms
> > > > encoding  = PREFIX_TREE,  scanner = StoreFileScanner; time = 545 ms
> > > >
> > > > -Vladimir
> > > >
> > > >
> > > >
> > > >
> > > > On Sat, Oct 19, 2013 at 8:50 PM, lars hofhansl <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > That is (unfortunately) a known issue. The main problem is that
> HBase
> > > > > expects each KV to be backed by a contiguous byte[]. For any prefix
> > > > > encoding it is thus necessary to rematerialize the KV (i.e. copy
> all
> > > the
> > > > > partial bytes into a new location).
> > > > > That is inefficient. Nobody has taken on to fix this (we're 1/2
> there
> > > > with
> > > > > Cells in 0.96, though).
> > > > >
> > > > > There a jiras out there to fix this like HBASE-7320 and more
> recently
> > > > > HBASE-9794.
> > > > >
> > > > > Now, which encoder did you test specifically? I seen a 20-40%
> > slowdown
> > > > > when everything is in the blockcache (which is the worst case
> > scenario
> > > > > here), certainly not a 10x slowdown.
> > > > >
> > > > > Note that with block encoding the block are stored encoded in the
> > > > > blockcache, so more data fits into the cache, and (obviously)
> there's
> > > > less
> > > > > IO when the data is not in the cache). So the extra work CPU cycles