Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - prefix compression


Copy link to this message
-
Re: prefix compression
Matt Corgan 2011-06-04, 04:57
Ah - I see.  It's generating multiple duplicate timestamps per millisecond,
so there are fewer than 50mm unique strings.  Duplicates just require
incrementing a counter.  Agree it's very cool though!

sent from my phone
On Jun 3, 2011 9:02 PM, "Jason Rutherglen" <[EMAIL PROTECTED]>
wrote:
> Yeah it's truly super wild! Here's the code: http://pastebin.com/bnB53UQz
>
> You can see the line that's adding the string:
>
> fstBuilder.add(new BytesRef(date), new Long(x));
>
> On Fri, Jun 3, 2011 at 8:56 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
>> Jason - are you feeding it that whole string for each date?  Input data
is
>> 17 bytes per record * 50mm records = 850MB, and that reduces to 984
bytes?
>>  Is it possible to compress by that much?  Maybe I'm missing something
about
>> how the FST works.
>>
>> Matt
>>
>>
>> On Fri, Jun 3, 2011 at 8:51 PM, Jason Rutherglen <
[EMAIL PROTECTED]
>>> wrote:
>>
>>> Also the next thing to measure with the FST is the key lookup speed.
>>> I'm not sure what that'd look like, or how to compare with HBase right
>>> now?
>>>
>>> On Fri, Jun 3, 2011 at 8:42 PM, Jason Rutherglen
>>> <[EMAIL PROTECTED]> wrote:
>>> > Here's a nice preliminary number with the FST, 50 million dates of the
>>> > form yyyyMMddHHmmssSSS, with each incremented by one millisecond.  The
>>> > FST is 984 bytes, with an incrementing long to point to the presumably
>>> > MMap'd value data.  This's a bit crazy.
>>> >
>>> > Perhaps we should try other increments as well?  Given that HBase keys
>>> > especially are probably close increments of each other, I think the
>>> > FST can always be loaded into RAM with pointers out to the actual
>>> > values.
>>> >
>>>
>>