Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> prefix compression implementation


Copy link to this message
-
Re: prefix compression implementation
Ryan - i answered your question on another thread yesterday.  Will use this
thread to continue conversation on the KeyValue interface.

I don't think the name is all that important, though i thought HCell was
less clumsy than KeyValue or KeyValueInterface.  Take a look at this
interface on github:

https://github.com/hotpads/hbase-prefix-trie/blob/master/src/org/apache/hadoop/hbase/model/HCell.java

Seems like it should be trivially easy to get KeyValue to implement that.
 Then it provides the right methods to make compareTo methods that will work
across different implementations.  The implementations of those methods
might have an if-statement to determine the class of the "other" HCell, and
choose the fastest byte comparison method behind the scenes.

I need to look into the KeyValue scanner interfaces
On Fri, Sep 16, 2011 at 7:34 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:

> On Fri, Sep 16, 2011 at 7:29 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> > Ryan - thanks for the feedback.  The situation I'm thinking of where it's
> > useful to parse DirectBB without copying to heap is when you are serving
> > small random values out of the block cache.  At HotPads, we'd like to
> store
> > hundreds of GB of real estate listing data in memory so it can be quickly
> > served up at random.  We want to access many small values that are
> already
> > in memory, so basically skipping step 1 of 3 because values are already
> in
> > memory.  That being said, the DirectBB are not essential for us since we
> > haven't run into gb problems, i just figured it would be nice to support
> > them since they seem to be important to other people.
> >
> > My motivation for doing this is to make hbase a viable candidate for a
> > large, auto-partitioned, sorted, *in-memory* database.  Not the usual
> > analytics use case, but i think hbase would be great for this.
>
> What exactly about the current system makes it not a viable candidate?
>
>
>
>
>
> >
> >
> > On Fri, Sep 16, 2011 at 7:08 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> >
> >> On Fri, Sep 16, 2011 at 6:47 PM, Matt Corgan <[EMAIL PROTECTED]>
> wrote:
> >> > I'm a little confused over the direction of the DBBs in general, hence
> >> the
> >> > lack of clarity in my code.
> >> >
> >> > I see value in doing fine-grained parsing of the DBB if you're going
> to
> >> have
> >> > a large block of data and only want to retrieve a small KV from the
> >> middle
> >> > of it.  With this trie design, you can navigate your way through the
> DBB
> >> > without copying hardly anything to the heap.  It would be a shame blow
> >> away
> >> > your entire L1 cache by loading a whole 256KB block onto heap if you
> only
> >> > want to read 200 bytes out of the middle... it can be done
> >> > ultra-efficiently.
> >>
> >> This paragraph is not factually correct.  The DirectByteBuffer vs main
> >> heap has nothing to do with the CPU cache.  Consider the following
> >> scenario:
> >>
> >> - read block from DFS
> >> - scan block in ram
> >> - prepare result set for client
> >>
> >> Pretty simple, we have a choice in step 1:
> >> - write to java heap
> >> - write to DirectByteBuffer off-heap controlled memory
> >>
> >> in either case, you are copying to memory, and therefore cycling thru
> >> the cpu cache (of course).  The difference is whether the Java GC has
> >> to deal with the aftermath or not.
> >>
> >> So the question "DBB or not" is not one about CPU caches, but one
> >> about garbage collection.  Of course, nothing is free, and dealing
> >> with DBB requires extensive in-situ bounds checking (look at the
> >> source code for that class!), and also requires manual memory
> >> management on the behalf of the programmer.  So you are faced with an
> >> expensive API (getByte is not as good at an array get), and a lot more
> >> homework to do.  I have decided it's not worth it personally and
> >> aren't chasing that line as a potential performance improvement, and I
> >> also would encourage you not to as well.