Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - HBase split policy


Copy link to this message
-
Re: HBase split policy
ramkrishna vasudevan 2013-01-22, 14:02
>>Also, last thing. If I want to change the default behaviour and split
>>based on the row number instead of the midkey, can I hook somewhere?

HTableDescriptor myHtd = new HTableDescriptor();
    myHtd.setValue(HTableDescriptor.SPLIT_POLICY,
        KeyPrefixRegionSplitPolicy.class.getName());
So the region split policy can be changed only during table creation i
suppose.  (May be wrong, not sure anyother way out there).

When i meant split based on row key my point was like use
admin.split(rowkey).  I will check more on your calculations and figures
and get back to you.

Regards
Ram
On Tue, Jan 22, 2013 at 7:17 PM, Jean-Marc Spaggiari <
[EMAIL PROTECTED]> wrote:

> Hi Anoop, Hi Ram,
>
> Thanks for your replies.
>
> I looked at the code and found in the HFileBlockIndex the midkey
> function which is doing the computation used in the
> Store.getSplitPoint() method.
>
> Now, if all the keys are almost equals in size, and the table has only
> one big 10GB region, if we lower the maxfilesize parameter to
> something like 300MB, we should see only almost equals regions, right?
> It's not the result I got. So I'm trying to figure where I'm wrong.
>
> Also, last thing. If I want to change the default behaviour and split
> based on the row number instead of the midkey, can I hook somewhere?
>
> Or will I have to disable the default split (by setting the
> maxfilesize to something like 20GB) and run a job to split the regions
> manually?
>
> Thanks,
>
> JM
>
> 2013/1/22, ramkrishna vasudevan <[EMAIL PROTECTED]>:
> > Hi Jean
> >
> > Before replying as to what i know, region splits can be configured too.
> >
> > Ok, now on how the split happens
> > -> You can explicity ask the region to get splitted on a specific row
> key.
> >  If you know that splitting on that rowkey will yield you almost equal
> > region sizes.
> > -> Now when HBase tries to split, it just takes the midkey from the
> HFiles.
> >  Here the midkey is the one that is the first key in the mid block of the
> > HFile.
> > Also the individual rows cannot be split. So if one row is nearly the
> size
> > of the region and other rows are smaller in size, it tries to find the
> mid
> > block inside the HFile and the size of one the block is going to be very
> > huge and that may be splitted as one region.  I know this has to do with
> > the internals of the splitting code.
> >
> >
> > Regards
> > Ram
> >
> > On Tue, Jan 22, 2013 at 5:12 PM, Jean-Marc Spaggiari <
> > [EMAIL PROTECTED]> wrote:
> >
> >> Hi,
> >>
> >> I'm wondering, what is HBase split policy.
> >>
> >> I mean, let's imagine this situation.
> >>
> >> I have a region full of rows starting from AA to AZ. Thousands of
> >> hundreds. I also have few rows from B to DZ. Let's say only one
> >> hundred.
> >>
> >> Region is just above the maxfilesize, so it's fine.
> >>
> >> No, I add "A" and store a very big row into it. Almost half the size
> >> of my maxfilesize value. That mean it's now time to split this row.
> >>
> >> How will HBase decide where to split it? Is it going to use the
> >> lexical order? Which mean it will split somewhere between B and C? If
> >> it's done that way, I will have one VERY small region, and one VERY
> >> big which will still be over the maxfilesize and will need to be split
> >> again, and most probably many times, right?
> >>
> >> Or will HBase take the middle of the region, look at the closest key,
> >> and cut there?
> >>
> >> Yesterday, for one table, I merged all my regions into a single one.
> >> This gave me something like a 10GB region. Since I want to have at
> >> least 100 regions for this table, I have setup the maxfilesize to
> >> 100MB. I have restarted HBase, and let it worked over night.
> >>
> >> This morning, I have some very big regions, still over the 100MB, and
> >> some very small. And the big regions are at least hundred times bigger
> >> than the small one.
> >>
> >> I just stopped the cluster again to re-merge the regions into a single