Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> KeyValue size in bytes compared to store files size


Copy link to this message
-
Re: KeyValue size in bytes compared to store files size
@Stack: I counted both compressed and uncompressed tables and it's the
same, this is really the case where 100GB can be compressed to 7 :)
@Lars: I took a look at https://issues.apache.org/jira/browse/HBASE-4218 and
it mentions that could make writing and scanning slower, since I write only
with bulk load I'm not worried about that  but how slower will scanning be ?
On Fri, Jan 17, 2014 at 8:20 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Somewhat unrelated, but you might benefit from block encoding in addition
> to compression in your case.
> Try to set DATA_BLOCK_ENCODING to FAST_DIFF in your column families.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Amit Sela <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Thursday, January 16, 2014 1:00 AM
> Subject: Re: KeyValue size in bytes compared to store files size
>
> I tried the bulk load and kv size counts with uncompressed table and it
> makes sense now. count is equal to store file size.
> I took a look at the (uncompressed) files and they seem to be OK.
>
> Entire bulk load is ~100GB, when using GZ ends up to be 7GB.
>
> Could such a compression ratio make sense in case of many qualifiers per
> row in a table (avg is 16 but in practice there are some rows with much
> more and even a small number of rows with hundreds of thousands...) ? If
> each KeyValue contains the rowkey, and the rowkeys contain more bytes than
> the qualifiers / values, than the rows repeat themselves in the HFile and
> actually make most of the HFile, right ?
>
>
>
>
>
>
>
> On Wed, Jan 15, 2014 at 9:52 PM, Stack <[EMAIL PROTECTED]> wrote:
>
> > There can be a lot of duplication in what ends up in HFiles but 500MB ->
> > 32MB does seem too good to be true.
> >
> > Could you try writing without GZIP or mess with the hfile reader[1] to
> see
> > what your keys look like when at rest in an HFile (and maybe save the
> > decompressed hfile to compare sizes?)
> >
> > St.Ack
> > 1. http://hbase.apache.org/book.html#hfile
> >
> >
> > On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <[EMAIL PROTECTED]> wrote:
> >
> > > I'm talking about the store files size and the ratio between store file
> > > size and the byte count as counted in PutSortReducer.
> > >
> > >
> > > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Hi all,
> > > > > I'm trying to measure the size (in bytes) of the data I'm about to
> > load
> > > > > into HBase.
> > > > > I'm using bulk load with PutSortReducer.
> > > > > All bulk load data is loaded into new regions and not added to
> > existing
> > > > > ones.
> > > > >
> > > > > In order to count the size of all KeyValues in the Put object I
> > iterate
> > > > > over the Put's familyMap.values() and sum the KeyValue lengths.
> > > > > After loading the data, I check the region size by summing the
> > > > > RegionLoad.getStorefileSizeMB().
> > > > > Counting the Put objects size predicted ~500MB per region but in
> > > > practice I
> > > > > got ~32MB per region.
> > > > > the table uses GZ compression but this cannot be the cause of such
> a
> > > > > difference.
> > > > >
> > > > > Is counting the Put's KeyValues the correct way to count a row
> size ?
> > > Is
> > > > it
> > > > > comparable to the store files size ?
> > > > >
> > > > > Thanks,
> > > > > Amit.
> > > > >
> > > >
> > >
> >
>
>