Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # dev >> HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)


Copy link to this message
-
Re: HBase as a large, auto-partitioned, sorted, *in-memory* database (was: Re: prefix compression implementation)
inline below:

On Mon, Sep 19, 2011 at 10:08 PM, Stack <[EMAIL PROTECTED]> wrote:

> Excellent summary Matt.  Some notes in the below.
>
> On Sun, Sep 18, 2011 at 6:43 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> > ... All of this is relatively easy for the data
> > and index blocks because they're immutable.  Doing it for the memstore is
> > another story...
> >
>
> We'd need another data structure completely, wouldn't we?  Have you
> given it any thought?
>

i've given it some thought, but i think it can wait till after the block
format stuff is in place.  a lot of the utility methods from that can be
used for the memstore implementation, and there should be some interfaces in
place by then too.  the memstore changes are easier than the block format
changes in a few regards: the fact that it's not persisted anywhere and
doesn't have to be backwards compatible.

the closest published thing i've found to what i envision is this
paper<http://www.google.com/url?sa=t&source=web&cd=1&sqi=2&ved=0CBwQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.85.3498%26rep%3Drep1%26type%3Dpdf&rct=j&q=cache%20efficient%20string%20sorting%20with%20copying&ei=HNd4TvLQGeWqsQKt65SsDQ&usg=AFQjCNFGNRlPcu9xykCGcnmCkn9tEhdtLg>about
a copying burstsort.  it builds the memstore into a trie structure,
but where the leaves are big byte[]'s that hold many keys until they fill
up.  then you split the leaves (burst them).  it's cpu, memory, and garbage
collector friendly.

>
>
> > Here's some contrived data to illustrate.  Row key is a (reversed
> > domain) url.  Column family is called "familyName".  ColQualifier is a
> > browser type.  Value is the number of views from that browser type.
>  Here's
> > the current representation <http://pastebin.com/7ks8kzJ2>, the prefix
> trie's
> > toString output <http://pastebin.com/cL4AkCPC> for the row keys, and
> removing
> > the whitespace <http://pastebin.com/4qiSXNh9> from the toString output.
>
> Thats a big diff.
>
> > The second problem is the lack of random access to cell offsets within
> the
> > data block.  (I'm not 100% sure on this one, so please correct me if i'm
> > wrong).  I noticed how bad this problem is when i was storing historical
> > event logs with 8 byte keys and small values (so ~30b per cell).  I had
> to
> > increase block size to 256KB because the block indexes were too big to
> fit
> > in memory.  Then I needed fast random access to these events.  The
> problem
> > is that there are ~10,000 cells per block, so without random lookups
> inside
> > the block, it's seeking through ~5,000 keys for the average lookup.
>  That's
> > a crazy amount of overhead to retrieve a single cell.  Probably the
> quickest
> > solution to this problem is to store the offset of each key in a list of
> > integers at the end of the block so that it's possible to do a binary
> search
> > inside the block.  That would reduce it to ~14 avg memory accesses to
> find
> > the cell.
> >
>
> I'd imagine that we'd not want the index always, just when keys per
> block went over some size.
>
> hfilev2 should help here because we don't load all indices all the time.
>

i dug into HFileV2 yesterday.  it's really great, solving the biggest
problem i currently face: the problem where all indexes are held in memory.
 it also addresses my point about lack of random access inside blocks by
adding a "secondary index", but only for the index blocks as far as i could
tell.  adding a secondary index to the data blocks would be great, but it's
not that interesting to me because i'd rather go a step further and fix the
prefix compression problem at the same time
> > But, the prefix trie should actually be way faster than a binary search
> > because you don't have to keep comparing the beginning of the key to the
> one
> > you're looking for.  I'll save the details for another email or for code
> > comments.  In general, with a smarter block/index format we could be much
> > more efficient than the current method of comparing full key byte[] over

thanks.  i'll create some issues after i learn a little more from you guys