Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - prefix compression


Copy link to this message
-
prefix compression
Matt Corgan 2011-06-02, 22:28
I've been reading up on prefix compression for a few weeks trying to figure
out the best approach.  Maybe some of you have noticed that 90% of my emails
to the user list have something to do with it...  Now that people are
working on HFile v2 and the topic of Lucene's FST has come up, I thought it
would be a good time to share some thoughts.  My goal here is definitely not
asking anyone to implement this, but just to start a discussion that gets
the proper jiras in place.  I have spent a little time coding an initial
rowKey trie builder that could be the start of a pluggable implementation.

This is a big topic to address in one email, but I don't think the pieces
make sense on their own.  Here's an outline with some thoughts:

* refer to prefix compression as "compaction" to avoid interfering with
traditional "compression"

* main goal is to reduce size of data in memory to increase effective
capacity of block index, block cache, and memstores
    * disk access takes thousands of times longer than memory access,
especially over hdfs/network

* secondary goal is to speed up the processing of in memory data
    * memory access takes 200+ cpu cycles, so must use cache friendly
algorithms
    * structure trie node sizes around typical 64B cache line size
    * investigate java's memory prefetching capabilities, possibly
increasing to ~256B nodes
        * research
paper<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>about
modern memory databases and prefetching
    * almost always use variable length integer/long encoding

* need separate compacted structures/algorithms for rowKey, colQualifier,
and timestamp
    * structures will need to interact with varInt pointers to link
rowKey<->colQualifier<->timestamp

* i think in the medium to long term (and maybe even short term) it would be
best to code this from scratch
Static row key compaction for block cache and data blocks:
* avoid schemes where each row is a diff of the previous (expensive cpu)
* utilize the fact that row keys are always sorted (unlike column
qualifiers)
    * this makes for fast tree construction during flushing and compaction
* low hanging fruit: (rowKeyCompaction=simple)
    * pull off the common prefix of all rows
    * have a flag indicating if row key is repeat of the previous
* more aggressive: use modified Cache Sensitive Search
Trees<http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998/cucs-019-98.pdf>
(rowKeyCompaction=cssTree64,
cssTree256, etc)
    * modified to accept full byte range 0-255, store full byte strings in
the tree, remove all pointers, optimize for cache line size, etc, but
similar idea
    * CSS trees are considered sub-par because they are not easily modified,
but that is ok in hbase's block cache and data blocks
* input is a sorted list of byte[], output is a byte[] containing the entire
prefix trie
* possibly put all keys in the front of the block, and all values at the
end, with value offsets/lengths in the keys

Backwards compatibility
* block cache could quickly compute the trie when loaded, therefore working
with HFile v1
    * eventually want to store the computed trie on disk
* HFile v2 will need to store information about the pluggable data block
compaction schemes
Memstore dynamic row key compaction:
* this will be much more difficult than the static compaction because it
needs to accept unsorted byte[]'s and support row locking
* use modified C-Burstsort<http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
for
dynamic unsorted data
* fast and compact (best dynamic string sorting algorithm i could find)
* fragmentation friendly (not original goal, but ends up working like MSLAB)
    * allocates large (customizable) byte[] for holding multiple leaf values
    * "bursts" the byte[] when they fill up, creating multiple child arrays
* will need to add some sort of lockable object to the leaf of the rowKey
trie
Column qualifier compaction:
* config param colQualifierCompaction=(none, lookupTable, cssTree64, etc)
* use lookupTable when column qualifiers are a small bounded set
    * store the lookup table in memory (similar to blockIndex or
fixedFileTrailer)
    * on second thought, this could be difficult without 2-pass compaction,
but maybe still possible
* use cssTree when a large or unbounded set, like when appending a numerical
suffix in wide rows
Timestamp compaction:
* timestampCompaction=(none, hfileDiff, cssTree, flatten)
* each hfile needs a lowestTimestamp stored in the trailer
* hflieDiff would store a varLong relative to the lowestTimestamp
* cssTree would do the same as it does for rowKeys, but the trie overhead
might wipe out the space savings with such small values
* flatten would not store a timestamp in each record, rather use the hfile's
lowestTimestamp for every record
    * many use cases don't care about the timestamp
    * i think this works for maxVersions=1.  may need more elaborate scheme
for multiple versions
I'll stop there...  sorry for the huge email!  I haven't seen much
discussion on email or jira about how prefix compression should be done, so
I hope this provides somewhat of a starting point.

Matt