It depends... There are some reasons to do this however in general, you don't need to do this...
The course is wrong to suggest this as a best practice.
Sent from my iPhone
On Jun 5, 2012, at 5:00 PM, "Atif Khan" <[EMAIL PROTECTED]> wrote:
> During a recent Cloudera course we were told that it is "Best practice" to
> isolate a MapReduce/HDFS cluster from an HBase/HDFS cluster as the two when
> sharing the same HDFS cluster could lead to performance problems. I am not
> sure if this is entirely true given the fact that the main concept behind
> Hadoop is to export computation to the data and not import data to the
> computation. If I were to segregate HBase and MapReduce clusters, then when
> using MapReduce on HBase data would I not have to transfer large amounts of
> data from HBase/HDFS cluster to MapReduce/HDFS cluster?
> Cloudera on their best practice page
> (http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) has the
> "Be careful when running mixed workloads on an HBase cluster. When you have
> SLAs on HBase access independent of any MapReduce jobs (for example, a
> transformation in Pig and serving data from HBase) run them on separate
> clusters. HBase is CPU and Memory intensive with sporadic large sequential
> I/O access while MapReduce jobs are primarily I/O bound with fixed memory
> and sporadic CPU. Combined these can lead to unpredictable latencies for
> HBase and CPU contention between the two. A shared cluster also requires
> fewer task slots per node to accommodate for HBase CPU requirements
> (generally half the slots on each node that you would allocate without
> HBase). Also keep an eye on memory swap. If HBase starts to swap there is a
> good chance it will miss a heartbeat and get dropped from the cluster. On a
> busy cluster this may overload another region, causing it to swap and a
> cascade of failures."
> All my initial investigation/reading lead me believe that I should a create
> a common HDFS cluster and then I can run MapReduce and HBase against the
> common HDFS cluster. But from the above Cloudera best practice it seems
> like I should create two HDFS clusters, one for MapReduce and one for HBase
> and then move data around when required. Something does not make sense with
> this best practice recommendation.
> Any thoughts and/or feedback will be much appreciated.
> View this message in context: http://old.nabble.com/Shared-Cluster-between-HBase-and-MapReduce-tp33967219p33967219.html
> Sent from the HBase User mailing list archive at Nabble.com.