Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # dev >> Shared HDFS for HBase and MapReduce


+
Atif Khan 2012-06-06, 03:29
+
Stack 2012-06-06, 04:07
+
Vladimir Rodionov 2012-06-06, 04:23
+
Mathias Herberts 2012-06-06, 07:19
+
Vladimir Rodionov 2012-06-06, 17:49
+
Stack 2012-06-06, 04:35
+
Atif Khan 2012-06-06, 18:15
+
Stack 2012-06-06, 19:04
+
Atif Khan 2012-06-06, 19:23
+
Amandeep Khurana 2012-06-06, 19:54
+
Atif Khan 2012-06-06, 20:27
+
Amandeep Khurana 2012-06-06, 21:05
+
Doug Meil 2012-06-06, 21:14
+
Atif Khan 2012-06-06, 21:23
Copy link to this message
-
Re: Shared HDFS for HBase and MapReduce
> Now my BIG question is about the BIG Data itself (no pun intended).  If I do
> create two HDFS clusters (one for MR and one for HBase), and then given that
> HBase acting as data source and sink; Would I not be forced to move LARGE
> amounts of data between the two HDFS clusters?  Given the size of the data,
> this could potentially congest the internal network on which the two
> independent HDFS clusters are deployed.
That's definitely true if HBase is the source and sink. Many
organizations that need to do both real-time serving do something more
akin to the following:

1) Split ingest of new data to feed both HBase and an HDFS/MR-only cluster.
2) Do batch processing on the HDFS/MR cluster
3) Push results either through the put-API or bulk load-API into HBase
with any updates/new tables the batch processes create.

This means that you only have to push the results to HBase and you can
view that as just another ingest source. That way, it's built into the
equation when you figure out how to size your HBase cluster.

Also, if you do run MR directly over your HBase cluster (or on a
shared HDFS) you must make sure to build that load into any sizing
calculations and that you can either mask the latency spikes that
might occur or accept them under your SLA.

-Joey

On Wed, Jun 6, 2012 at 2:15 PM, Atif Khan <[EMAIL PROTECTED]> wrote:
> Thanks to all who replied, especially Vladimir and Mathias!!!
>
> So if I understand this correctly, there is physical resource contention
> problem given that both MR and HBase are resource hungry.  Therefore, when
> end-user SLAs are in place, performance guarantees may be compromised when
> HBase and MR share the same HDFS cluster (and other resources).
>
> According to Mathias's suggestion, on production HDFS cluster, we could
> throttle/limit the MR activity so that it has minimal impact on HBase's
> (realtime) performance.
>
> So far so good.
>
> Now my BIG question is about the BIG Data itself (no pun intended).  If I do
> create two HDFS clusters (one for MR and one for HBase), and then given that
> HBase acting as data source and sink; Would I not be forced to move LARGE
> amounts of data between the two HDFS clusters?  Given the size of the data,
> this could potentially congest the internal network on which the two
> independent HDFS clusters are deployed.
>
> Thoughts?
>
> --
> View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018878.html
> Sent from the HBase - Developer mailing list archive at Nabble.com.

--
Joey Echeverria
Principal Solutions Architect
Cloudera, Inc.