Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Thoughts on a hybrid HBase-Hadoop cluster


Copy link to this message
-
Re: Thoughts on a hybrid HBase-Hadoop cluster
Otis,

You could co-locate RS' with TT and DN for the most part as long as you are
not really serving "real time" requests. Just tweak your task configs and
give HBase enough RAM. You get the benefit of data locality and that could
improve performance. But you should definitely try out your approach and
compare them. If you are going to be scanning tables and doing a bunch of
writes back, I'm suspecting there will be a fair bit of network traffic and
that might be your bottleneck. Whereas, if you co-locate, you can
potentially have better performance.

Are you planning on doing this on EC2 or did you mention the small and
large instances as an example?

-Amandeep

On Tue, Dec 13, 2011 at 11:44 AM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote:

> Hi,
>
> I was wondering if I could get some feedback on the craziness (or not) of
> setting up a hybrid HBase-Hadoop cluster that has the following primary
> uses:
>
> 1) continuous writes to HBase
> 2) disk and CPU intensive reads from HBase by MR jobs and writes of
> aggregated data back to HBase by those jobs
> 3) occasional reads by people/reporting apps that read aggregates from
> HBase
>
> I'm calling this hybrid HBase-Hadoop cluster because not all nodes in the
> cluster would be running both a RegionServer and DataNode + TaskTracker.
> Instead, this is what it could look like:
>
> * a set of *larger* nodes running RegionServers, DataNodes, TaskTrackers
> (e.g., large EC2 instances)
> * a set of *smaller* nodes running only DNs and TTs, but *not* RSs (e.g.
> small EC2 instances)
>
>
> The thinking here is that because that 2) above needs to process a lot of
> data (lots of reads, good amount of writes, and relatively CPU intensive)
> it's nice to have more nodes and spindles.
> But if we put RSs on all nodes to put it close to DNs, then all nodes need
> to be relatively beefy in terms of RAM to keep HBase happy, and that
> translates to more $$$.
> So the thinking/hope is that one could save $ by having more
> smaller/cheaper nodes to do the disk IO and CPU intensive work, while
> having just enough RS instances on the big nodes to handle the HBase side
> of 1) 2) and 3) above.
>
>
> Is the above setup crazy?
>
> Are there some obvious flaws that would really cause operational of
> performance pains?
> Would such a cluster have major performance issues because of data that
> needs to be transferred between DNs that are on all nodes and RSs running
> only on the big nodes?
>
>
> Thanks,
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB