|
|
-
HBase: minimal number of boxes?
Otis Gospodnetic 2010-01-14, 05:01
Hello,
I was wondering what a minimal setup in terms of # of servers might be for HBase. Here is what I think is needed: 1 or 2 HBase master servers -- 1 or 2 dedicated boxes?
1 or more RegionServers -- 1 or more dedicated boxes?
1 or more Zookeepers -- 1 or more dedicated boxes? If running on HDFS, add: 1 or 2 NameNodes -- can this run on same box(es) as HBase master?
1 or more DataNodes -- can DNs be on same box(as) as RegionServers? If you want to run MR jobs on data in HBase, add: 1 or more JobTrackers -- can this run on the same box as HBase master and NN?
1 or more TaskTrackers -- can this run on the same box as RegionServer + DN?
So, my main questions are:
* Is it OK for HBase Master and NameNode (+JobTracker) to run on the same server? NN needs memory. What does HBase Master need the most?
* Is it OK for RegionServer and DataNode (+TaskTracker) to run on the same server? (I think this is actually advised, so data is local?) I believe RegionMaster is a memory hungry (b/c of Memcache) process? I believe DNs need the CPU to run the MR jobs, and disk I/O, of course.
* Finally, is the following correct? Non-HA system, with local disk: 1 HB master/NN/JT + 1 RegionServer/TT/DN + 1 ZK = 3 boxes
HA HBase cluster with HDFS: 2 HB masters/NNs/JTs + 2 RegionServers/TTs/DNs + 2 ZKs = 6 boxes
Thanks, Otis
-
Re: HBase: minimal number of boxes?
Andrew Purtell 2010-01-15, 19:51
Hi Otis,
> * Is it OK for HBase Master and NameNode (+JobTracker) to run on > the same server? NN needs memory. What does HBase Master need > the most?
The HBase Master is normally not very busy. It just needs to be available when region servers check in, and for maintaining timely Zookeeper heartbeats. As long as there is sufficient RAM on the combined NameNode+Master (+JobTracker) such that the system never swaps, this is ok.
You can consider running multiple HBase masters to remove one SPOF from the deployment, but the Hadoop side still has issues -- NameNode, JobTracker. But, yes, for a non-HA deployment it makes sense to load all of these up on one server.
> * Is it OK for RegionServer and DataNode (+TaskTracker) to run on > the same server? (I think this is actually advised, so data is > local?)
Yes this is advised for that reason. Eventually, through background compaction, the data in HDFS which backs the region stores is brought local. MapReduce jobs run against HBase after this happens get data locality as each split corresponds to a region and the task will be scheduled on the corresponding region server.
> I believe RegionMaster is a memory hungry (b/c of Memcache) > process?
Yes. The more RAM you can give to the region servers, the better for performance:
- Read caching (block cache) to avoid needing to hit the filesystem to serve frequently accessed data
- Write caching (MemStore) to ride over flushes and compactions without blocking clients
> 1 or more Zookeepers -- 1 or more dedicated boxes?
I would advise running a dedicated ZK quorum ensemble, yes. ZK is a 2N+1 fault tolerant system, so deploy 3 servers if you can stand to lose only one, or 5 if you want to be able to lose up to 2, etc. IIRC, there are diminishing returns after 7 or 9. Though this may seem like a lot of overhead just to run HBase, ZK has a lot of merit on its own terms for providing synchronization primitives for your service or application, hosting dynamic config (and use watchers to get notice of changes), presence and group membership, etc.
> Non-HA system, with local disk: > 1 HB master/NN/JT + 1 RegionServer/TT/DN + 1 ZK = 3 boxes
Too small. It is my experience you need 3 RegionServer/TT/DN for something minimally useful. Also remember to tune HDFS for such a small cluster -- set minimum replication to 1 or 2.
> HA HBase cluster with HDFS: > 2 HB masters/NNs/JTs + 2 RegionServers/TTs/DNs + 2 ZKs = 6 boxes
Too small, likewise.
Hope this helps,
- Andy
|
|