Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> default region splitting on which value?


Copy link to this message
-
Re: default region splitting on which value?
The answer to your first question is yes - midkey of the key range would be chosen as split key.

For #2, can you tell us how you plan to randomize the loading ?
Bulk load normally means preparing HFiles which would be loaded directly into your table.

Cheers

On Apr 20, 2013, at 1:11 PM, Pal Konyves <[EMAIL PROTECTED]> wrote:

> Hi Ted,
> Only one family, my data is very simple key-value, although I want to make
> sequential scan, so making a hash of the key is not an option.
>
>
>
> On Sat, Apr 20, 2013 at 10:07 PM, Ted Yu <[EMAIL PROTECTED]> wrote:
>
>> How many column families do you have ?
>>
>> For #3, per-splitting table at the row keys corresponding to peaks makes
>> sense.
>>
>> On Apr 20, 2013, at 10:52 AM, Pal Konyves <[EMAIL PROTECTED]> wrote:
>>
>>> Hi,
>>>
>>> I am just reading about region splitting. By default - as I understand -
>>> Hbase handles splitting the regions. I just don't know how to imagine on
>>> which key it splits the regions.
>>>
>>> 1) For example when I write MD5 hash of rowkeys, they are most probably
>>> evenly distributed from
>>> 000000... to FFFFF... right? When  Hbase starts with one region, all the
>>> writes goes into that region, and when the HFile get's too big, it just
>>> gets for example the median value of the stored keys, and split the
>> region
>>> by this?
>>>
>>> 2) I want to bulk load tons of data with the HBase java client API put
>>> operations. I want it to perform well. My keys are numeric sequential
>>> values (which I know from this post, I cannot load into Hbase
>> sequentially,
>>> because the Hbase tables are going to be sad
>> http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
>>> )
>>> So I thought I would pre-split the table into regions, and load the data
>>> randomized. This way I will get good distribution among region servers in
>>> terms of network IO from the beginning. Is that a good idea?
>>>
>>> 3) If my rowkeys are not evenly distributed in the keyspace, but they
>> show
>>> some peaks or bursts. e.g. 000-999, but most of the keys gather around
>> 020
>>> and 060 values, is it a good idea to have the pre region splits at those
>>> peaks?
>>>
>>> Thanks in advance,
>>> Pal
>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB