Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> default region splitting on which value?


Copy link to this message
-
default region splitting on which value?
Hi,

I am just reading about region splitting. By default - as I understand -
Hbase handles splitting the regions. I just don't know how to imagine on
which key it splits the regions.

1) For example when I write MD5 hash of rowkeys, they are most probably
evenly distributed from
000000... to FFFFF... right? When  Hbase starts with one region, all the
writes goes into that region, and when the HFile get's too big, it just
gets for example the median value of the stored keys, and split the region
by this?

2) I want to bulk load tons of data with the HBase java client API put
operations. I want it to perform well. My keys are numeric sequential
values (which I know from this post, I cannot load into Hbase sequentially,
because the Hbase tables are going to be sad
http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/
 )
So I thought I would pre-split the table into regions, and load the data
randomized. This way I will get good distribution among region servers in
terms of network IO from the beginning. Is that a good idea?

3) If my rowkeys are not evenly distributed in the keyspace, but they show
some peaks or bursts. e.g. 000-999, but most of the keys gather around 020
and 060 values, is it a good idea to have the pre region splits at those
peaks?

Thanks in advance,
Pal
+
Ted Yu 2013-04-20, 20:07
+
Pal Konyves 2013-04-20, 20:11
+
Ted Yu 2013-04-20, 20:54
+
Pal Konyves 2013-04-20, 21:24
+
Ted Yu 2013-04-21, 01:34
+
Pal Konyves 2013-04-21, 11:21
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB