There can be a lot of duplication in what ends up in HFiles but 500MB ->
32MB does seem too good to be true.
Could you try writing without GZIP or mess with the hfile reader to see
what your keys look like when at rest in an HFile (and maybe save the
decompressed hfile to compare sizes?)
On Wed, Jan 15, 2014 at 7:43 AM, Amit Sela <[EMAIL PROTECTED]> wrote:
> I'm talking about the store files size and the ratio between store file
> size and the byte count as counted in PutSortReducer.
> On Wed, Jan 15, 2014 at 5:35 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
> > See previous discussion: http://search-hadoop.com/m/85S3A1DgZHP1
> > On Wed, Jan 15, 2014 at 5:44 AM, Amit Sela <[EMAIL PROTECTED]> wrote:
> > > Hi all,
> > > I'm trying to measure the size (in bytes) of the data I'm about to load
> > > into HBase.
> > > I'm using bulk load with PutSortReducer.
> > > All bulk load data is loaded into new regions and not added to existing
> > > ones.
> > >
> > > In order to count the size of all KeyValues in the Put object I iterate
> > > over the Put's familyMap.values() and sum the KeyValue lengths.
> > > After loading the data, I check the region size by summing the
> > > RegionLoad.getStorefileSizeMB().
> > > Counting the Put objects size predicted ~500MB per region but in
> > practice I
> > > got ~32MB per region.
> > > the table uses GZ compression but this cannot be the cause of such a
> > > difference.
> > >
> > > Is counting the Put's KeyValues the correct way to count a row size ?
> > it
> > > comparable to the store files size ?
> > >
> > > Thanks,
> > > Amit.
> > >