A basic rule of thumb is 1 core and 1 GB RAM per JVM. The Hadoop
and HBase daemons will all need such an allocation. You can
extend this to the mapreduce subsystem when considering how many
mappers and/or reducers can concurrently execute on each node
alongside the rest of what you are running. Or, you can choose to
partition your hardware to support separate HDFS and HBase from
the mapreduce task runners, as some do, which changes the
Lots of people try to run all-in-one clusters, where all
functions are more or less co-located on every node. Strictly
speaking, how much heap a TaskTracker map or reduce task child
will require depends on the user application. But, it still
loads the CPU so I still use the 1 CPU/1 GB rule of thumb even
for these. Overload your CPU resources and the JVM scheduler
will starve threads, introducing spurious heartbeat misses,
timeouts, and recovery behaviors in system daemons that will
unnecessarily degrade performance and operation. One thing I
have considered but not tried is using Linux CPU affinity masks
to put system functions in one partition and all user mapreduce
tasks in the other. Another option as I mentioned is to split
hardware resources among the functions.
Here is what I have used in the past in a successful all-in-
one deployment. In parentheses next to the Java process' name
is the heap allocation reserved with -Xmx.
1: NameNode (2000) and DataNode (1000)
1: HMaster (1000), JobTracker (1000), and DataNode (1000)
23: DataNode (1000), HRegionServer (2000), TaskTracker (1000),
and the concurrency limit for mappers and reducers set
to 4 and 4, respectively.
We picked a midpoint between cheap hardware and big iron. Our per
node specs was dual quad core, 4/8 GB RAM, 6x 1TB disk. 2x1TB
hosted the system volume in RAID-1 configuration. The remaining
4x1TB drives were attached as JBOD and used as DataNode data
volumes. The rationale for using so much disk per node was
maximization of cluster/rack density.
As the size of your HDFS volume increases, you'll need to grow
the heap allocation of your NameNode accordingly. In all my time
running HBase I never needed more than 2GB allocated to it, but
I hear that Facebook runs a NameNode with a 20GB heap.
A word of warning however: Currently HBase is a very challenging
user of HDFS. In 0.20 there are some changes (HFile) which
lessens somewhat the number of open files and should also lower
the total number of DataNode xceivers necessary to support
operations. However on my 25 node cluster running Hadoop/HBase
0.19, I found it necessary to increase the DataNode xceiver limit
to 4096 (from its default of 512!) to successfully bootstrap a
HBase cluster with > 7000 regions. Therefore it may not be the
per-node spec that is the determining factor for the stability of
your cluster, but rather the number of DataNodes employed to
sufficiently spread the load.
Hope that helps,
> From: Amandeep Khurana <[EMAIL PROTECTED]>
> Subject: Typical hardware configurations
> To: [EMAIL PROTECTED], [EMAIL PROTECTED]
> Date: Friday, March 27, 2009, 10:07 PM
> What are the typical hardware config for a node that people
> are using for Hadoop and HBase? I am setting up a new 10 node
> cluster which will have HBase running as well that will be
> feeding my front end directly. Currently, I had a 3 node
> cluster with 2 GB of RAM on the slaves and 4 GB of RAM on the
> master. This didnt work very well due to the RAM being a
> little low.
> I got some config details from the powered by page on the
> Hadoop wiki, but nothing like that for Hbase.
> Amandeep Khurana
> Computer Science Graduate Student
> University of California, Santa Cruz