we are running a 9 node Hadoop/Hbase cluster on EC2. We expect the number
of nodes to grow multi fold in a couple of months.
Of these 9 instances, three are m1.medium and they act as Hadoop Namenode,
Hbase master and the secondary namenode. The remaining 6 nodes are all
mi.large instances where we run datanodes,tasktrackers and the
The large instances have 7.5 GB of total memory.
There are couple of issues that we are facing, and we feel that this
architecture is not going to scale for us.
1) The uptime. We have been struggling to keep the cluster up and running.
Sometimes the region-servers fail due to overload. Sometimes the GC pause
forces the zookeeper to declare them dead. Sometime the regionserver get
killed due to OutOfMemory.
The map reduce always brings atleast 2-3 region-servers down.
2) We have 4G memory allocated for Region servers and we run data nodes,
task trackers on them as well. We have observed that our region servers are
vulnerable when we run MR jobs in our cluster. Is this some sign of
competing resources (Memory) on the region servers or is this(having RS and
data nodes/task trackers) generally not advisable?
3) So far our data size is close to 3TB(including replication) split over 6
region servers. An example of regionserver stat- numberOfOnlineRegions=231,
numberOfStores=2055, numberOfStorefiles=1180, storefileIndexSizeMB=15,
We would like to understand what is an ideal load for a region server -
both in terms of rows and the actual data that it can serve? This is
because, I would like understand if we are over pounding our system which
it is not expected to handle.
4). We have always observed that, our Region Servers go down usually just
after a long GC pause(DURATION), because, it prevented the RSs from acking
to ZK for the session maintenance, or, it is usually a OOM.
5)The cost. The large instances and the 3 factor replication are expensive.
We are not sure if horizontally scaling with large instances is the way to
It'd be very useful if we can understand some already known limitations of
the system, so that, we wont end up spending time in the wrong direction.
Though we have been able to fix a lot of issues when they happen, we are
looking for a more stable architecture. We feel that the cluster setup we
have is wrong and we need confirmation and suggestions from the community.
we will really appreciate your suggestions/pointers and any ideas would be
very helpful. Also, we are hoping you can just tell us about your
architecture which is not giving you nightmares.