I was wondering if I could get some feedback on the craziness (or not) of setting up a hybrid HBase-Hadoop cluster that has the following primary uses:
1) continuous writes to HBase
2) disk and CPU intensive reads from HBase by MR jobs and writes of aggregated data back to HBase by those jobs
3) occasional reads by people/reporting apps that read aggregates from HBase
I'm calling this hybrid HBase-Hadoop cluster because not all nodes in the cluster would be running both a RegionServer and DataNode + TaskTracker.
Instead, this is what it could look like:
* a set of *larger* nodes running RegionServers, DataNodes, TaskTrackers (e.g., large EC2 instances)
* a set of *smaller* nodes running only DNs and TTs, but *not* RSs (e.g. small EC2 instances)
The thinking here is that because that 2) above needs to process a lot of data (lots of reads, good amount of writes, and relatively CPU intensive) it's nice to have more nodes and spindles.
But if we put RSs on all nodes to put it close to DNs, then all nodes need to be relatively beefy in terms of RAM to keep HBase happy, and that translates to more $$$.
So the thinking/hope is that one could save $ by having more smaller/cheaper nodes to do the disk IO and CPU intensive work, while having just enough RS instances on the big nodes to handle the HBase side of 1) 2) and 3) above.
Is the above setup crazy?
Are there some obvious flaws that would really cause operational of performance pains?
Would such a cluster have major performance issues because of data that needs to be transferred between DNs that are on all nodes and RSs running only on the big nodes?
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/