>According to my understanding, the way that HBase works is that on a
>brand new system, all keys will start going to a single region i.e. a
>single region server. Once that region
>reaches a max region size, it will split and then move to another
>region server, and so on and so forth.
Basically, the default table create is 1 region per table that can only go
to 1 RS. Splits happen on that region when it gets large enough, but
balancing the new region to another server is an asynchronous event and
doesn't happen immediately after the first split because of
"hbase.regions.slop". The idea is to create the table with R regions
across S servers so each server has R/S regions and puts will be roughly
uniformly distributed across the R regions, keeping every server equally
busy. Sounds like you have a good handle on this behavior.
>My strategy is to take the incoming data, calculate the hash and then
>mod the hash with the number of machines I have. I will split the
>regions according to the prefix # .
>This should , I think provide for better data distribution when the
>cluster first starts up with one region / region server.
The problem with this strategy: so say you split into 256 regions. Region
splits are basically memcmp() ranges, so they would look like this:
Key prefix Region
0x00 - 0x01 1
0x01 - 0x02 2
0x02 - 0x03 3
Let's say you prefix your key with the machine ID#. You are probably
using a UINT32 for the machine ID, but let's assume that your using a
UINT8 for this discussion. Your M machine IDs would map to exactly 1
region each. So only M out of 256 regions would contain any data. Every
time you add a new machine, all the puts will only go to one region. By
prefixing your key with a hash (MD5, SHA1, whatevs), you'll get random
distribution on your prefix and will populate all 256 regions evenly.
(FYI: If you were using a monotonically increasing UINT32, this would be
worse because they'd probably be low numbers and all map to the 0x00
>These regions should then grow fairly uniformly. Once they reach a
>size like ~ 5G, I can do a rolling split.
I think you want to do rolling splits very infrequently. We have PB of
data and only rolling split twice. The more regions / server, the faster
server recovery happens because there's more parallelism for distributed
log splitting. If you have too many regions / cluster, you can overwork
the load balancer on the master and increase startup time. We're running
at 20 regions/server * 95 regionservers ~= 1900 regions on the master.
How many servers do you have? If you have M servers, I'd try to split
into M*(M-1) regions but keep that value lower than 1000 if possible.
It's also nice to split on log2 boundaries for readability, but
technically the optimal scenario is to have 1 server die and have every
other server split & add exactly the same amount of its regions. M*(M-1)
would give every other server 1 of the regions.
>Also, I want to make sure my regions do not grow too much in size that
>when I end up adding more machines, it does not take a very long time
>to perform a rolling split.
Meh. This happens. This is also what rolling split is designed/optimized
for. It persists the split plan to HDFS so you can stop it and restart
later. It also round robin's the splits across servers & prioritizes
splitting low-loaded servers to minimize region-based load balancing
during the process.
>What I do not understand is the advantages/disasvantages of having
>regions that are too big vs regions that are too thin. What does this
>impact ? Compaction time ? Split time ? What is the
>concern when it comes to how the architecture works. I think if I
>understand this better, I can manage my regions more efficiently.
Are you IO-bound or CPU-bound? The default HBase setup is geared towards
the CPU-bound use case and you should have acceptable performance out of
the box after pre-splitting and a couple well-documented configs (ulimit
comes to mind). In our case, we have medium sized data and are fronted by
an application cache, so we're IO-bound. Because of this, we need to
tweak the compaction algorithm to minimize IO writes at the expense of
StoreFiles and use optional features like BloomFilters & the TimeRange
Filter to minimize the number of StoreFiles we need to query on a get.