The KV storage will be like
KeyLength (4 bytes) + Value length(4 bytes) + rowkeylength(2bytes) + rowkey(.. bytes) + CF length(1 byte) + CF (...bytes) + Qualifier(..bytes) + timestamp(8 bytes) + type(1 byte) + value (...bytes)
If you are using HFile V2 there will be memstoreTS also added with every KV. This will be 1 to 4 bytes long. (Mostly 1 byte as the value will be reset to 0 during compaction)
Now calculate whether the size u found is matching with the expected.
If you are using version 94, there is block encoding feature in which most of these extra bytes other than key and value can be encoded to smaller size.
From: Sever Fundatureanu [[EMAIL PROTECTED]]
Sent: Tuesday, July 03, 2012 8:36 PM
To: [EMAIL PROTECTED]
Subject: Re: HBase table disk usage
I was only du'ing the table dir. The tmp dirs only had a couple of hundred
bytes in my case.
The HFile tool only gives the avgKeyLen=46. This does not include 4 bytes
KeyLength + 4 bytes ValueLength.
Now indeed I get a total of 54 bytes/KV *1.5 billion ~= 81GB. Probably
there are also leftovers from HDFS blocks not being fully occupied.
On Tue, Jul 3, 2012 at 2:29 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Tue, Jul 3, 2012 at 2:17 PM, Sever Fundatureanu
> <[EMAIL PROTECTED]> wrote:
> > Right, forgot about the timestamps. These should be a long value each,
> so 8
> > bytes. The versioning is set to 1 so it shouldn't count.
> > Note the column qualifier is also void on each entry.
> > So now we get (33+1+8)x1.5*10^9 = 63GB, still a 19GB difference...
> What about regionserver WAL logs? You including these in your math or
> are you just du'ing the table dir? The table dir can have tmp dirs
> for compaction and split work. And after Michael Segel, the KV has a
> type byte as well as some lengths for finding offsets in KV; take a
> looksee w/ the hfile tool:
Vrije Universiteit Amsterdam
E-mail: [EMAIL PROTECTED]