-Re: Meaning of storefileIndexSize
Renaud Delbru 2010-05-18, 16:45
On 18/05/10 17:31, Stack wrote:
> On Tue, May 18, 2010 at 9:04 AM, Renaud Delbru<[EMAIL PROTECTED]> wrote:
>> We have 90M of rows, each rows varies from a few hundreds of kilobytes to
> Index keeps the 'key' that starts each block in an hfile and its
> offset where the 'key' is a combination of row+column+timestamp (not
> the value). Your 'keys' are large?
Our row keys are just plain web document urls. Column name are a few
characters. So, I will say fairly small.
>> I have also changed at the same time another parameter, the
>> hbase.hregion.max.filesize. It was set to 1GB (from previous test), and I
>> switch it back to the default value (256MB).
>> So, in the previous tests, there was a few number of region files (like
>> 250), but a very large index file size (>500).
>> In my last test (hregion.max.filesize=256, block size=128K), the number of
>> region files increased (I have now more than 1000 region file), but the
>> index file size is now less than 200.
>> Do you think the hregion.max.filesize could had impact on the index file
>> size ?
> Hmm. You have same amount of "data" just more files because you
> lowered max filesize (by a factor of 4 so 4x the number of files) so
> I'd expect that index would be of the same size.
Ok, so it is jsut the modification of block size which reduces the index
> If inclined to do more digging, you can use the hfile tool:
> ./bin/hbase org.apache.hadoop.hbase.io.hfile.HFile
> Do the above and you'll get usage. Print out the metadata on hfiles.
> Might help you figure whats going on.
I'll have a look at this.
>> So, storefileIndexSize indicates the number of MB of heap used by the index.
>> And, in our case, 500 was too excessive given the fact that our region
>> server is limited to 1GB of heap
> If 1GB only, then yeah, big indices will cause a prob. How many
> regions per regionserver? Sounds like you have a few? If so, can you
> add more servers? Or up the RAM in your machines?
Yes, we have four nodes, each node has currently 280 region files
(approximatively). We are not able to increase the number of nodes or
the RAM for the moment. So, our solution was to tune hbase for our
setup. But, finally, hbase seems to handle it well.
Using the new configuration settings, I was able to import our 90M rows
in less than 11 hours (using a map-reduce job on the same cluster),
while keeping the used heap of the region servers relatively small (300
Now, the region servers looks stable, with a relatively small heap used,
even if I use the hbase table as a map reduce input format.
So, it seems that the memory problem was related to the hfile block size.