Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Region number and allocation advice


Copy link to this message
-
Re: Region number and allocation advice
Jack Levin 2012-07-06, 21:27
Please note that if you have fast growing datastore, you may end up
with very large region files - if you limit the number of regions.  If
that happens (and you can tell by simply examining your HDFS), your
compactions (which you can't avoid) will end up rewriting a lot of
data.  In our case (we have 600TB hbase cluster for storing photos),
we keep our region size at about 10-15GB and no more. Anything else
will overdrive your disk IO during compactions.

On Fri, Jul 6, 2012 at 2:22 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote:
> Hey Nick,
>
> I'd say it has nothing to do with spindles/CPUs. The reason is that
> locking happens only at the row level thus having 1 or 100 regions
> doesn't change anything, then you only have 1 HLog to write to
> (currently at least) so spindles don't really count either. It might
> influence the number of threads that can compact tho.
>
> I gave some explanations at the HBase BoF we did after the Hadoop
> Summit[1] and, from the writes POV, you want to have as much MemStore
> potential as you have dedicated heap for it and the HLog need to match
> that too. One region can make sense for that except for the issues I
> demonstrated and also it's harder to balance. Instead, having a few
> regions will give you more agility. For the sake of giving a number
> I'd say 10-20 but then you need to tune your memstore size
> accordingly. This really is over-optimization tho and it's not even
> something we put in practice here since we have so many different use
> cases.
>
> The cluster size in my opinion doesn't influence the number of regions
> you should have per machine, as long as you keep it low it will be
> fine.
>
> Hope this helps,
>
> J-D
>
> 1. http://www.slideshare.net/jdcryans/performance-bof12
>
> On Fri, Jul 6, 2012 at 1:53 PM, Nick Dimiduk <[EMAIL PROTECTED]> wrote:
>> Heya,
>>
>> I'm looking for more detailed advice about how many regions a table should
>> run. Disabling automatic splits (often hand-in-hand with disabling
>> automatic compactions) is often described as advanced practice, at least
>> when guaranteeing latency SLAs. Which begs the question: how many regions
>> should I have? Surely this depends on both the shape of your data and
>> expected workload. I've seen "10-20 Regions per RS" thrown around as a
>> stock answer. My question is: why? Presumably that's 10-20 regions per RS
>> for all tables rather than per-table. That advice is centered around a
>> regular region size, but surely distribution of ops/load matters more. But
>> still, where does 10-20 come from? Is it a calculation vs the number of
>> cores on the RS, like advice given around parallelizing builds? If so, how
>> many cores are we assuming the RS has? Is it a calculation vs the amount of
>> RAM available? Is 20 regions based on a trade-off between static
>> allocations and per-region memory overhead? Does 10-20 become 5-15 in a
>> memory-restricted environment and bump to 20-40 when more RAM is available?
>> Does it have to do with the number of spindles available on the machine?
>> Threads like this one [0] give some hint about how the big players work.
>> However, that advice looks heavily influenced by concerns when there are
>> 1000's of regions to manage. How does advice for larger clusters (>50
>> nodes) differ from smaller clusters (<20 nodes)?
>>
>> Thanks,
>> -n
>>
>> [0]: http://thread.gmane.org/gmane.comp.java.hadoop.hbase.user/22451