Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> prefix compression


Copy link to this message
-
Re: prefix compression
Thanks for the feedback Stack.  Some inline responses:

On Thu, Jun 2, 2011 at 9:48 PM, Stack <[EMAIL PROTECTED]> wrote:

> High-level this sounds like a great.
>
> Inline below is some feedback and a bit of history on how we got here
> in case it helps:
>
> On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> > * refer to prefix compression as "compaction" to avoid interfering with
> > traditional "compression"
> >
>
> 'Compaction' is already a term in hbase.  Your suggestion below may
> reproduce an issue we currently have where the term 'split' can refer
> to the splitting of logs on regionserver recovery or spit of a region
> because its grown too large.  Just a headsup.
>
> ha, yes, not sure how i overlooked that.  Then maybe name the config
params: rowKeyFormat, colQualifierFormat, and timestampFormat

>
> > * main goal is to reduce size of data in memory to increase effective
> > capacity of block index, block cache, and memstores
>
>
> Ok.  At some cost in CPU.
>
>
> >    * disk access takes thousands of times longer than memory access,
> > especially over hdfs/network
> >
>
> Yes.  I would underline the clause that starts 'especially' in the above.
>
>
> > * secondary goal is to speed up the processing of in memory data
> >    * memory access takes 200+ cpu cycles, so must use cache friendly
> > algorithms
> >    * structure trie node sizes around typical 64B cache line size
>
> How would you measure our effectiveness here?
>

The most important thing is to make the format pluggable, then we can run
HFile performance tests to find the optimal tuning for different
hardware/workloads.

The idea with the 64B is that if you're going to go to memory and get
64B (or 8 words) back, you might as well have designed your data structure
so that you can make use of all 64B.  This requires that you structure the
prefix tries a little differently than older literature recommends.
>
>
> >    * investigate java's memory prefetching capabilities, possibly
> > increasing to ~256B nodes
>
> Do you have a pointer on this phenomeon (I'm ignorant and would learn
> more)?
>
>
Paying attention to pre-fetching would be a super-optimization, but it is a
very real thing that can buy you another ~20% performance.  64B is the
normal memory fetch size right now, but sometimes the hardware will bring
back even larger chunks.  Some languages offer specific commands to
prefetch... java doesn't, but the compiler is adding them behind the scenes
so it could possibly be coerced.  The best reading on the topic is probably
the paper i mentioned earlier
(link<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>),
and then its references if you're still interested.
>
> >    * almost always use variable length integer/long encoding
> >
>
> In my limited experience I've found these to be CPU intensive making
> the parse of a compound byte array of rowkey+family+ etc. such as KV
> is now complicated.
>
> On other hand, our Benoit reports having made significant improvements
> over the hadoop varlength I was using.
>
> Pluggable formats would help here so you could tune for mem vs cpu.  My
thought is that most hbase vints would be very short (1-2 byte internal
array offsets) which are faster to decode, but more importantly, save a lot
of memory.  This topic is covered a lot in discussions of inverted index
posting list compression.  Not perfectly relevant, but here's a good
paper<http://www2008.org/papers/pdf/p387-zhangA.pdf>about compressing
integers/longs.  I have no data to back it up, but I feel
like if search engines are primarily storing lists of longs and they decide
it's best to compress them, then that's probably the best method for other
pieces of software as well.
> > * need separate compacted structures/algorithms for rowKey, colQualifier,

All three parts of the KV could share the same trie algorithm for
compressing arbitrary byte arrays.  However, I think sometimes you can
compress colQualifiers even further (and faster) with a lookup table.  And
timestamps even further with a delta scheme or by completely omitting them.

Agreed - i'm working on some simple code examples in my free time.
 Hopefully can post in a couple weeks.
data block.  If you take a data block and split it into 3 parallel lists of
equal length: a list of rowKeys, a list of colQualifiers, and a list of
timestamps, the rowKeys will be automatically be sorted, but the other two
will not be (unless the whole block is contained in one row).  Does that
make sense?

For memstore and block index, you take the common prefix of the first and
last rowKey.  For data block, take the common prefix of first and last key.
 As tables get really large, they are more and more likely to have longer
common prefixes.
I think it's the hardware that determines the cache line size and does some
of the prefetching, so java is directly affected by those.  Unfortunately,
I'm not sure it's possible to align the start of an array on a cache line in
java like it is in C/C++, however, on average you can make use of half the
cache line... and there's always hope for the future.
Not sure it's always best to do this, but the idea is to not pollute the
cache with value data unless it's actually needed.  Think of the case of a
scanner with a high block cache hit ratio but returning sparse results.
 This may be an improvement on the case where prefix compression isn't
needed to begin with.  Another case for pluggable formats and testing...
Agree about backwards compatibility being a bad idea (especially in a
pre-1.0 product).  I guess I meant "current compatibility", like it could be
included in 0.90.4 since it doesn't require changes in serialization or file
formats.  Not pushing for that,
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB