Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> KeyValue size in bytes compared to store files size


Copy link to this message
-
Re: KeyValue size in bytes compared to store files size
@Stack: I counted both compressed and uncompressed tables and it's the
same, this is really the case where 100GB can be compressed to 7 :)
@Lars: I took a look at https://issues.apache.org/jira/browse/HBASE-4218 and
it mentions that could make writing and scanning slower, since I write only
with bulk load I'm not worried about that  but how slower will scanning be ?
On Fri, Jan 17, 2014 at 8:20 PM, lars hofhansl <[EMAIL PROTECTED]> wrote:

> Somewhat unrelated, but you might benefit from block encoding in addition
> to compression in your case.
> Try to set DATA_BLOCK_ENCODING to FAST_DIFF in your column families.
>
> -- Lars
>
>
>
> ----- Original Message -----
> From: Amit Sela <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Thursday, January 16, 2014 1:00 AM
> Subject: Re: KeyValue size in bytes compared to store files size
>
> I tried the bulk load and kv size counts with uncompressed table and it
> makes sense now. count is equal to store file size.
> I took a look at the (uncompressed) files and they seem to be OK.
>
> Entire bulk load is ~100GB, when using GZ ends up to be 7GB.
>
> Could such a compression ratio make sense in case of many qualifiers per
> row in a table (avg is 16 but in practice there are some rows with much
> more and even a small number of rows with hundreds of thousands...) ? If
> each KeyValue contains the rowkey, and the rowkeys contain more bytes than
> the qualifiers / values, than the rows repeat themselves in the HFile and
> actually make most of the HFile, right ?
>
>
>
>
>
>
>
> On Wed, Jan 15, 2014 at 9:52 PM, Stack <[EMAIL PROTECTED]> wrote:
>
> > There can be a lot of duplication in what ends up in HFiles but 500MB ->
> > 32MB does seem too good to be true.
> >
> > Could you try writing without GZIP or mess with the hfile reader[1] to
> see
> > what your keys look like when at rest in an HFile (and maybe save the
> > decompressed hfile to compare sizes?)
> >
> > St.Ack
> > 1. http://hbase.apache.org/book.html#hfile
> >
> >
> > On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <[EMAIL PROTECTED]> wrote:
> >
> > > I'm talking about the store files size and the ratio between store file
> > > size and the byte count as counted in PutSortReducer.
> > >
> > >
> > > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > >
> > > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
> > > >
> > > >
> > > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[EMAIL PROTECTED]>
> > wrote:
> > > >
> > > > > Hi all,
> > > > > I'm trying to measure the size (in bytes) of the data I'm about to
> > load
> > > > > into HBase.
> > > > > I'm using bulk load with PutSortReducer.
> > > > > All bulk load data is loaded into new regions and not added to
> > existing
> > > > > ones.
> > > > >
> > > > > In order to count the size of all KeyValues in the Put object I
> > iterate
> > > > > over the Put's familyMap.values() and sum the KeyValue lengths.
> > > > > After loading the data, I check the region size by summing the
> > > > > RegionLoad.getStorefileSizeMB().
> > > > > Counting the Put objects size predicted ~500MB per region but in
> > > > practice I
> > > > > got ~32MB per region.
> > > > > the table uses GZ compression but this cannot be the cause of such
> a
> > > > > difference.
> > > > >
> > > > > Is counting the Put's KeyValues the correct way to count a row
> size ?
> > > Is
> > > > it
> > > > > comparable to the store files size ?
> > > > >
> > > > > Thanks,
> > > > > Amit.
> > > > >
> > > >
> > >
> >
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB