Hi Anoop, Hi Ram,
Thanks for your replies.
I looked at the code and found in the HFileBlockIndex the midkey
function which is doing the computation used in the
Now, if all the keys are almost equals in size, and the table has only
one big 10GB region, if we lower the maxfilesize parameter to
something like 300MB, we should see only almost equals regions, right?
It's not the result I got. So I'm trying to figure where I'm wrong.
Also, last thing. If I want to change the default behaviour and split
based on the row number instead of the midkey, can I hook somewhere?
Or will I have to disable the default split (by setting the
maxfilesize to something like 20GB) and run a job to split the regions
2013/1/22, ramkrishna vasudevan <[EMAIL PROTECTED]>:
> Hi Jean
> Before replying as to what i know, region splits can be configured too.
> Ok, now on how the split happens
> -> You can explicity ask the region to get splitted on a specific row key.
> If you know that splitting on that rowkey will yield you almost equal
> region sizes.
> -> Now when HBase tries to split, it just takes the midkey from the HFiles.
> Here the midkey is the one that is the first key in the mid block of the
> Also the individual rows cannot be split. So if one row is nearly the size
> of the region and other rows are smaller in size, it tries to find the mid
> block inside the HFile and the size of one the block is going to be very
> huge and that may be splitted as one region. I know this has to do with
> the internals of the splitting code.
> On Tue, Jan 22, 2013 at 5:12 PM, Jean-Marc Spaggiari <
> [EMAIL PROTECTED]> wrote:
>> I'm wondering, what is HBase split policy.
>> I mean, let's imagine this situation.
>> I have a region full of rows starting from AA to AZ. Thousands of
>> hundreds. I also have few rows from B to DZ. Let's say only one
>> Region is just above the maxfilesize, so it's fine.
>> No, I add "A" and store a very big row into it. Almost half the size
>> of my maxfilesize value. That mean it's now time to split this row.
>> How will HBase decide where to split it? Is it going to use the
>> lexical order? Which mean it will split somewhere between B and C? If
>> it's done that way, I will have one VERY small region, and one VERY
>> big which will still be over the maxfilesize and will need to be split
>> again, and most probably many times, right?
>> Or will HBase take the middle of the region, look at the closest key,
>> and cut there?
>> Yesterday, for one table, I merged all my regions into a single one.
>> This gave me something like a 10GB region. Since I want to have at
>> least 100 regions for this table, I have setup the maxfilesize to
>> 100MB. I have restarted HBase, and let it worked over night.
>> This morning, I have some very big regions, still over the 100MB, and
>> some very small. And the big regions are at least hundred times bigger
>> than the small one.
>> I just stopped the cluster again to re-merge the regions into a single
>> one and see if I have not done something wrong in the process, but in
>> the meantime, I'm looking for more information about the way HBase is
>> deciding where to cut, and if there is a way to customize that.
>> PS: Numbers are out of my head. I don't really recall how big the last
>> region was yesterday. I will take more notes when the current
>> MassMerge will be done.