Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> HBase split policy


+
Jean-Marc Spaggiari 2013-01-22, 11:42
Copy link to this message
-
RE: HBase split policy
Jean good topic.
When a region splits it is the HFile(s) split happening.  You know HFile logically split into "n" HFileBlocks and we will be having index meta data for these blocks at every HFile level.   HBase will find the midkey from these block index data. It will take the mid block as the split point.
So it all depends on how the data is spread across different HFileBlocks. So when you split a region [a,e) it need not be split at point "c". It all depends on how many data you have corresponding to each rowkey patterns.

One more thing to remember that some time there can be really big HFileBlocks . Even though the default size for a block is 64K some times it can be much larger than this. One row can not be split into 2 or more blocks. It needs to be in one block. So it can so happen that when a split happens bigger blocks going to one daughter making that region as still big !!...   [When one row is really huge comparing to others]

Some thoughts on the topic as per my limited knowledge on the code. ...

-Anoop-
________________________________________
From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
Sent: Tuesday, January 22, 2013 5:12 PM
To: user
Subject: HBase split policy

Hi,

I'm wondering, what is HBase split policy.

I mean, let's imagine this situation.

I have a region full of rows starting from AA to AZ. Thousands of
hundreds. I also have few rows from B to DZ. Let's say only one
hundred.

Region is just above the maxfilesize, so it's fine.

No, I add "A" and store a very big row into it. Almost half the size
of my maxfilesize value. That mean it's now time to split this row.

How will HBase decide where to split it? Is it going to use the
lexical order? Which mean it will split somewhere between B and C? If
it's done that way, I will have one VERY small region, and one VERY
big which will still be over the maxfilesize and will need to be split
again, and most probably many times, right?

Or will HBase take the middle of the region, look at the closest key,
and cut there?

Yesterday, for one table, I merged all my regions into a single one.
This gave me something like a 10GB region. Since I want to have at
least 100 regions for this table, I have setup the maxfilesize to
100MB. I have restarted HBase, and let it worked over night.

This morning, I have some very big regions, still over the 100MB, and
some very small. And the big regions are at least hundred times bigger
than the small one.

I just stopped the cluster again to re-merge the regions into a single
one and see if I have not done something wrong in the process, but in
the meantime, I'm looking for more information about the way HBase is
deciding where to cut, and if there is a way to customize that.

Thanks,

JM

PS: Numbers are out of my head. I don't really recall how big the last
region was yesterday. I will take more notes when the current
MassMerge will be done.
+
ramkrishna vasudevan 2013-01-22, 13:38
+
Jean-Marc Spaggiari 2013-01-22, 13:47
+
ramkrishna vasudevan 2013-01-22, 14:02
+
Jean-Marc Spaggiari 2013-01-22, 14:10
+
Jean-Marc Spaggiari 2013-01-23, 02:39
+
Anoop Sam John 2013-01-23, 06:17
+
Jean-Marc Spaggiari 2013-01-23, 12:26
+
ramkrishna vasudevan 2013-01-23, 18:09
+
Jean-Marc Spaggiari 2013-01-23, 18:24
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB