Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> prefix compression implementation

Copy link to this message
Re: prefix compression implementation

Thanks for helping out with this.  I implemented most of the DeltaEncoder
and DeltaEncoderSeeker.  I haven't taken the time to generate a good set of
test data for any of this, but it does pass on some very small input data
that aims to cover the edge cases i can think of.  Perhaps you have full
HFiles you can run through it.


I also put some notes on the PtDeltaEncoder regarding how the prefix trie
should be optimally used.  I can't think of a situation where you'd want to
blow it up into the full uncompressed KeyValue ByteBuffer, so implementing
the DeltaEncoder interface is a mismatch, but I realize it's only a starting

You also would never really have a full ByteBuffer of KeyValues to pass to
it for compression.  Typically, you'd be passing individual KeyValues from
the memstore flush or from a collection of HFiles being merged through a

The end goal is to operate on the encoded trie without decompressing it.
 Long term, and in certain circumstances, it may even be possible to pass
the compressed trie over the wire to the client who can then decode it.

Let me know if I implemented that the way you had in mind.  I haven't done
the seekTo method yet, but will try to do that next week.


On Wed, Sep 14, 2011 at 3:43 PM, Jacek Migdal <[EMAIL PROTECTED]> wrote:

> Matt,
> Thanks a lot for the code. Great job!
> As I mentioned in JIRA I work full time on the delta encoding [1]. Right
> now the code and integration is almost done. Most of the parts are under
> review. Since it is a big change will plan to test it very carefully.
> After that, It will be ported to trunk and open sourced.
> I have a quick glimpse I have taken the different approach. I implemented
> a few different algorithms which are simpler. They also aims mostly to
> save space while having fast decompress/compress code. However the access
> is still sequential. The goal of my project is to save some RAM by having
> compressed BlockCache in memory.
> On the other hand, it seems that you are most concerned about seeking
> performance.
> I will read your code more carefully. A quick glimpse: we both implemented
> some routines (like vint), but expect that there is no overlap.
> I also seen that you spend some time investigating ByteBuffer vs. Byte[].
> I experienced significant negative performance impact when I switched to
> ByteBuffer. However I postpone this optimization.
> Right now I think the easiest way to go would be that you will implement
> DeltaEncoder interface after my change:
> http://pastebin.com/Y8UxUByG
> (note, there might be some minor changes)
> That way, you will reuse my integration with existing code for "free".
> Jacek
> [1] - I prefer to call it that way. Prefix is one of the algorithm, but
> there are also different approach.
> On 9/13/11 1:36 AM, "Ted Yu" <[EMAIL PROTECTED]> wrote:
> >Matt:
> >Thanks for the update.
> >Cacheable interface is defined in:
> >src/main/java/org/apache/hadoop/hbase/io/hfile/Cacheable.java
> >
> >You can find the implementation at:
> >src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
> >
> >I will browse your code later.
> >
> >On Tue, Sep 13, 2011 at 12:44 AM, Matt Corgan <[EMAIL PROTECTED]>
> wrote:
> >
> >> Hi devs,
> >>
> >> I put a "developer preview" of a prefix compression algorithm on github.
> >>  It
> >> still needs some details worked out, a full set of iterators, about 200
> >> optimizations, and a bunch of other stuff...  but, it successfully
> >>passes
> >> some preliminary tests so I thought I'd get it in front of more eyeballs
> >> sooner than later.
> >>
> >> https://github.com/hotpads/hbase-prefix-trie
> >>
> >> It depends on HBase's Bytes.java and KeyValue.java, which depends on
> >> hadoop.
> >>  Those jars are in there, along with some others for full HFile testing
> >>in
> >> the near future.
> >>
> >> A good place to start looking at the code