Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> KeyValue size in bytes compared to store files size


Copy link to this message
-
Re: KeyValue size in bytes compared to store files size
I tried the bulk load and kv size counts with uncompressed table and it
makes sense now. count is equal to store file size.
I took a look at the (uncompressed) files and they seem to be OK.

Entire bulk load is ~100GB, when using GZ ends up to be 7GB.

Could such a compression ratio make sense in case of many qualifiers per
row in a table (avg is 16 but in practice there are some rows with much
more and even a small number of rows with hundreds of thousands...) ? If
each KeyValue contains the rowkey, and the rowkeys contain more bytes than
the qualifiers / values, than the rows repeat themselves in the HFile and
actually make most of the HFile, right ?
On Wed, Jan 15, 2014 at 9:52 PM, Stack <[EMAIL PROTECTED]> wrote:

> There can be a lot of duplication in what ends up in HFiles but 500MB ->
> 32MB does seem too good to be true.
>
> Could you try writing without GZIP or mess with the hfile reader[1] to see
> what your keys look like when at rest in an HFile (and maybe save the
> decompressed hfile to compare sizes?)
>
> St.Ack
> 1. http://hbase.apache.org/book.html#hfile
>
>
> On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <[EMAIL PROTECTED]> wrote:
>
> > I'm talking about the store files size and the ratio between store file
> > size and the byte count as counted in PutSortReducer.
> >
> >
> > On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> >
> > > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
> > >
> > >
> > > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Hi all,
> > > > I'm trying to measure the size (in bytes) of the data I'm about to
> load
> > > > into HBase.
> > > > I'm using bulk load with PutSortReducer.
> > > > All bulk load data is loaded into new regions and not added to
> existing
> > > > ones.
> > > >
> > > > In order to count the size of all KeyValues in the Put object I
> iterate
> > > > over the Put's familyMap.values() and sum the KeyValue lengths.
> > > > After loading the data, I check the region size by summing the
> > > > RegionLoad.getStorefileSizeMB().
> > > > Counting the Put objects size predicted ~500MB per region but in
> > > practice I
> > > > got ~32MB per region.
> > > > the table uses GZ compression but this cannot be the cause of such a
> > > > difference.
> > > >
> > > > Is counting the Put's KeyValues the correct way to count a row size ?
> > Is
> > > it
> > > > comparable to the store files size ?
> > > >
> > > > Thanks,
> > > > Amit.
> > > >
> > >
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB