Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> region size/count per regionserver


Copy link to this message
-
Re: region size/count per regionserver
Simple answer
-------------
20 regions/server & <2000 regions/cluster is a good rule of thumb if you
can't profile your workload yet.  You really want to ensure that

1) You need to limits the regions/cluster so the master can have a
reasonable startup time & can handle all the region state transitions via
ZK.  Most bigger companies are running 2,000 in production and achieve
reasonable startup times (< 2 minutes for region assignment on cold
start).  If you want to test the scalability of that algorithm beyond what
other companies need, admin beware.
2) The more regions/server you have, the faster that recovery can happen
after RS death because you can currently parallelize recovery on a
region-granularity.  Too many regions/server and #1 starts to be a problem.

Complicated answer
------------------
More information is optimize this formula.  Additional considerations:

1) Are you IO-bound or CPU-bound
2) What is your grid topology like
3) What is your network hardware like
4) How many disks (not just size)
5) What is the data locality between RegionServer & DataNode

In the Facebook case, we have 5 racks with 20 nodes each.  Servers in the
rack are connected by 1G Eth to a switch with a 10G uplink.  We are
network bound.  Our saturation point is mostly commonly on the top-of-rack
switch.  With 20 regions/server, we can roughly parallelize our
distributed log splitting within a single rack on RS death (although 2
regions do split off-rack).  This minimizes top-of-rack traffic and
optimized our recovery time.  Even if you are CPU-bound, log splitting
(hence recovery time) is an IO-bound operation.  A lot of our work on
region assignment is about maximizing data locality, even on RS death, so
we avoid top-of-rack saturation.
On 11/1/11 10:54 AM, "Sujee Maniyam" <[EMAIL PROTECTED]> wrote:

>HI all,
>My HBase cluster is 10 nodes, each node has 12core ,   48G RAM, 24TB disk,
>10GEthernet.
>My region size is 1GB.
>
>Any guidelines on how many regions can a RS  handle comfortably?
>I vaguely remember reading some where to have no more than 1000 regions /
>server; that comes to 1TB / server.  Seems pretty low for the current
>hardware config.
>
>Any rules of thumb?  experiences?
>
>thanks
>Sujee
>
>http://sujee.net
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB