-Re: KeyValue size in bytes compared to store files size
lars hofhansl 2014-01-19, 00:00
you'll have to try. Currently in HBase the unencoded KeyValues have to be rematerialized during scanning, which slows it down, at the same time the block are store in the block cached in the encoded format so that more data will fit into the block cache, so it depends on your specific use case.
(so note that encoded blocks are stored in the blockcache as is, but compressed blocks need to be decompressed before they are cached)
From: Amit Sela <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; lars hofhansl <[EMAIL PROTECTED]>
Sent: Saturday, January 18, 2014 8:28 AM
Subject: Re: KeyValue size in bytes compared to store files size
@Stack: I counted both compressed and uncompressed tables and it's the same, this is really the case where 100GB can be compressed to 7 :)
@Lars: I took a look at https://issues.apache.org/jira/browse/HBASE-4218 and it mentions that could make writing and scanning slower, since I write only with bulk load I'm not worried about that but how slower will scanning be ?
On Fri, Jan 17, 2014 at 8:20 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:
Somewhat unrelated, but you might benefit from block encoding in addition to compression in your case.
>Try to set DATA_BLOCK_ENCODING to FAST_DIFF in your column families.
>----- Original Message -----
>From: Amit Sela <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Thursday, January 16, 2014 1:00 AM
>Subject: Re: KeyValue size in bytes compared to store files size
>I tried the bulk load and kv size counts with uncompressed table and it
>makes sense now. count is equal to store file size.
>I took a look at the (uncompressed) files and they seem to be OK.
>Entire bulk load is ~100GB, when using GZ ends up to be 7GB.
>Could such a compression ratio make sense in case of many qualifiers per
>row in a table (avg is 16 but in practice there are some rows with much
>more and even a small number of rows with hundreds of thousands...) ? If
>each KeyValue contains the rowkey, and the rowkeys contain more bytes than
>the qualifiers / values, than the rows repeat themselves in the HFile and
>actually make most of the HFile, right ?
>On Wed, Jan 15, 2014 at 9:52 PM, Stack <[EMAIL PROTECTED]> wrote:
>> There can be a lot of duplication in what ends up in HFiles but 500MB ->
>> 32MB does seem too good to be true.
>> Could you try writing without GZIP or mess with the hfile reader to see
>> what your keys look like when at rest in an HFile (and maybe save the
>> decompressed hfile to compare sizes?)
>> 1. http://hbase.apache.org/book.html#hfile
>> On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <[EMAIL PROTECTED]> wrote:
>> > I'm talking about the store files size and the ratio between store file
>> > size and the byte count as counted in PutSortReducer.
>> > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>> > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
>> > >
>> > >
>> > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[EMAIL PROTECTED]>
>> > >
>> > > > Hi all,
>> > > > I'm trying to measure the size (in bytes) of the data I'm about to
>> > > > into HBase.
>> > > > I'm using bulk load with PutSortReducer.
>> > > > All bulk load data is loaded into new regions and not added to
>> > > > ones.
>> > > >
>> > > > In order to count the size of all KeyValues in the Put object I
>> > > > over the Put's familyMap.values() and sum the KeyValue lengths.
>> > > > After loading the data, I check the region size by summing the
>> > > > RegionLoad.getStorefileSizeMB().
>> > > > Counting the Put objects size predicted ~500MB per region but in
>> > > practice I
>> > > > got ~32MB per region.
>> > > > the table uses GZ compression but this cannot be the cause of such a
>> > > > difference.
>> > > >
>> > > > Is counting the Put's KeyValues the correct way to count a row size ?