Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> default region splitting on which value?


Copy link to this message
-
Re: default region splitting on which value?
The answer to your first question is yes - midkey of the key range would be chosen as split key.

For #2, can you tell us how you plan to randomize the loading ?
Bulk load normally means preparing HFiles which would be loaded directly into your table.

Cheers

On Apr 20, 2013, at 1:11 PM, Pal Konyves <[EMAIL PROTECTED]> wrote:

> Hi Ted,
> Only one family, my data is very simple key-value, although I want to make
> sequential scan, so making a hash of the key is not an option.
>
>
>
> On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
>> How many column families do you have ?
>>
>> For #3, per-splitting table at the row keys corresponding to peaks makes
>> sense.
>>
>> On Apr 20, 2013, at 10:52 AM, Pal Konyves <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I am just reading about region splitting. By default - as I understand -
>>> Hbase handles splitting the regions. I just don't know how to imagine on
>>> which key it splits the regions.
>>>
>>> 1) For example when I write MD5 hash of rowkeys, they are most probably
>>> evenly distributed from
>>> 000000... to FFFFF... right? When  Hbase starts with one region, all the
>>> writes goes into that region, and when the HFile get's too big, it just
>>> gets for example the median value of the stored keys, and split the
>> region
>>> by this?
>>>
>>> 2) I want to bulk load tons of data with the HBase java client API put
>>> operations. I want it to perform well. My keys are numeric sequential
>>> values (which I know from this post, I cannot load into Hbase
>> sequentially,
>>> because the Hbase tables are going to be sad
>> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
>>> )
>>> So I thought I would pre-split the table into regions, and load the data
>>> randomized. This way I will get good distribution among region servers in
>>> terms of network IO from the beginning. Is that a good idea?
>>>
>>> 3) If my rowkeys are not evenly distributed in the keyspace, but they
>> show
>>> some peaks or bursts. e.g. 000-999, but most of the keys gather around
>> 020
>>> and 060 values, is it a good idea to have the pre region splits at those
>>> peaks?
>>>
>>> Thanks in advance,
>>> Pal
>>