|
|
-
Shared HDFS for HBase and MapReduce
Atif Khan 2012-06-06, 03:29
What is the "best practice" for HBase, MapReduce and HDFS deployment? We are interested in storing our data in HBase, and then run analytics on it using MapReduce. MapReduce will utilize data from HBase tables and HDFS files. My first thoughts were to create a single HDFS cluster, and then point the MapReduce and HBase servers to use the common HDFS installation. However, Cloudera's Dos and Don'ts page ( http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that MapReduce and HBase should not share an HDFS cluster. Rather they should have their own individual clusters. I don't understand this recommendation, as it would result in moving data around from one HDFS cluster to another when running MapReduce over HBase. Any help/ideas would be appreciated. Thanks! -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856.htmlSent from the HBase - Developer mailing list archive at Nabble.com.
+
Atif Khan 2012-06-06, 03:29
-
Re: Shared HDFS for HBase and MapReduce
Stack 2012-06-06, 04:07
On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[EMAIL PROTECTED]> wrote: > My first thoughts were to create a single HDFS cluster, and then point the > MapReduce and HBase servers to use the common HDFS installation. However, > Cloudera's Dos and Don'ts page > ( http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that > MapReduce and HBase should not share an HDFS cluster. Rather they should > have their own individual clusters. I don't understand this recommendation, > as it would result in moving data around from one HDFS cluster to another > when running MapReduce over HBase. > It starts out "Be careful when running mixed workloads on an HBase cluster." Does your use case fit the case described: "...SLAs on hbase access" and at the same time running heavy mapreduce jobs on same cluster? If so, you may want to do the suggested two clusters. I'd suggest you start w/ all on the one cluster and see how you do. That post is > a year old. HBase has gotten steadily better since. St.Ack
+
Stack 2012-06-06, 04:07
-
RE: Shared HDFS for HBase and MapReduce
Vladimir Rodionov 2012-06-06, 04:23
You can share HBase and MR if you run MR jobs only to process data off HBase and do not use HBase for real-time queries It is not generally advisable to share live (real-time) HBase cluster and run MR jobs at the same time as since HDFS can get easily saturated by MR jobs and you will have much worse HBase query latency and overall query throughput. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: [EMAIL PROTECTED] ________________________________________ From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of Stack [[EMAIL PROTECTED]] Sent: Tuesday, June 05, 2012 9:07 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Shared HDFS for HBase and MapReduce On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[EMAIL PROTECTED]> wrote: > My first thoughts were to create a single HDFS cluster, and then point the > MapReduce and HBase servers to use the common HDFS installation. However, > Cloudera's Dos and Don'ts page > ( http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that > MapReduce and HBase should not share an HDFS cluster. Rather they should > have their own individual clusters. I don't understand this recommendation, > as it would result in moving data around from one HDFS cluster to another > when running MapReduce over HBase. > It starts out "Be careful when running mixed workloads on an HBase cluster." Does your use case fit the case described: "...SLAs on hbase access" and at the same time running heavy mapreduce jobs on same cluster? If so, you may want to do the suggested two clusters. I'd suggest you start w/ all on the one cluster and see how you do. That post is > a year old. HBase has gotten steadily better since. St.Ack Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or [EMAIL PROTECTED] and delete or destroy any copy of this message and its attachments.
+
Vladimir Rodionov 2012-06-06, 04:23
-
RE: Shared HDFS for HBase and MapReduce
Mathias Herberts 2012-06-06, 07:19
We run M/R jobs that query HBase in a pool with a limited number of mapper slots, works like a charm to have both RT and batch queries on HBase On Jun 6, 2012 6:23 AM, "Vladimir Rodionov" <[EMAIL PROTECTED]> wrote: > You can share HBase and MR if you run MR jobs only to process data off > HBase and do not use HBase for real-time queries > It is not generally advisable to share live (real-time) HBase cluster and > run MR jobs at the same time as since HDFS can get easily saturated > by MR jobs and you will have much worse HBase query latency and overall > query throughput. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [EMAIL PROTECTED] > > ________________________________________ > From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of Stack [ > [EMAIL PROTECTED]] > Sent: Tuesday, June 05, 2012 9:07 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: Shared HDFS for HBase and MapReduce > > On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[EMAIL PROTECTED]> > wrote: > > My first thoughts were to create a single HDFS cluster, and then point > the > > MapReduce and HBase servers to use the common HDFS installation. > However, > > Cloudera's Dos and Don'ts page > > ( http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that > > MapReduce and HBase should not share an HDFS cluster. Rather they should > > have their own individual clusters. I don't understand this > recommendation, > > as it would result in moving data around from one HDFS cluster to another > > when running MapReduce over HBase. > > > > It starts out "Be careful when running mixed workloads on an HBase > cluster." Does your use case fit the case described: "...SLAs on > hbase access" and at the same time running heavy mapreduce jobs on > same cluster? If so, you may want to do the suggested two clusters. > > I'd suggest you start w/ all on the one cluster and see how you do. > That post is > a year old. HBase has gotten steadily better since. > > St.Ack > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [EMAIL PROTECTED] and > delete or destroy any copy of this message and its attachments. >
+
Mathias Herberts 2012-06-06, 07:19
-
RE: Shared HDFS for HBase and MapReduce
Vladimir Rodionov 2012-06-06, 17:49
Sure, limiting number of slots is a way of IO throttling for MR jobs If you can do this - go ahead and do this. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: [EMAIL PROTECTED] ________________________________________ From: Mathias Herberts [[EMAIL PROTECTED]] Sent: Wednesday, June 06, 2012 12:19 AM To: [EMAIL PROTECTED] Subject: RE: Shared HDFS for HBase and MapReduce We run M/R jobs that query HBase in a pool with a limited number of mapper slots, works like a charm to have both RT and batch queries on HBase On Jun 6, 2012 6:23 AM, "Vladimir Rodionov" <[EMAIL PROTECTED]> wrote: > You can share HBase and MR if you run MR jobs only to process data off > HBase and do not use HBase for real-time queries > It is not generally advisable to share live (real-time) HBase cluster and > run MR jobs at the same time as since HDFS can get easily saturated > by MR jobs and you will have much worse HBase query latency and overall > query throughput. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [EMAIL PROTECTED] > > ________________________________________ > From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of Stack [ > [EMAIL PROTECTED]] > Sent: Tuesday, June 05, 2012 9:07 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: Shared HDFS for HBase and MapReduce > > On Tue, Jun 5, 2012 at 8:29 PM, Atif Khan <[EMAIL PROTECTED]> > wrote: > > My first thoughts were to create a single HDFS cluster, and then point > the > > MapReduce and HBase servers to use the common HDFS installation. > However, > > Cloudera's Dos and Don'ts page > > ( http://www.cloudera.com/blog/2011/04/hbase-dos-and-donts/) insists that > > MapReduce and HBase should not share an HDFS cluster. Rather they should > > have their own individual clusters. I don't understand this > recommendation, > > as it would result in moving data around from one HDFS cluster to another > > when running MapReduce over HBase. > > > > It starts out "Be careful when running mixed workloads on an HBase > cluster." Does your use case fit the case described: "...SLAs on > hbase access" and at the same time running heavy mapreduce jobs on > same cluster? If so, you may want to do the suggested two clusters. > > I'd suggest you start w/ all on the one cluster and see how you do. > That post is > a year old. HBase has gotten steadily better since. > > St.Ack > > Confidentiality Notice: The information contained in this message, > including any attachments hereto, may be confidential and is intended to be > read only by the individual or entity to whom this message is addressed. If > the reader of this message is not the intended recipient or an agent or > designee of the intended recipient, please note that any review, use, > disclosure or distribution of this message or its attachments, in any form, > is strictly prohibited. If you have received this message in error, please > immediately notify the sender and/or [EMAIL PROTECTED] and > delete or destroy any copy of this message and its attachments. > Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or [EMAIL PROTECTED] and delete or destroy any copy of this message and its attachments.
+
Vladimir Rodionov 2012-06-06, 17:49
-
Re: Shared HDFS for HBase and MapReduce
Stack 2012-06-06, 04:35
On Tue, Jun 5, 2012 at 9:07 PM, Stack <[EMAIL PROTECTED]> wrote: > It starts out "Be careful when running mixed workloads on an HBase > cluster." Does your use case fit the case described: "...SLAs on > hbase access" and at the same time running heavy mapreduce jobs on > same cluster? If so, you may want to do the suggested two clusters. > > I'd suggest you start w/ all on the one cluster and see how you do. > That post is > a year old. HBase has gotten steadily better since. >
Please ignore my barebones response above. I see the question was asked earlier and the quality of responses were much more substantial and of higher quality (or see Vladimir's on this thread).
St.Ack
+
Stack 2012-06-06, 04:35
-
Re: Shared HDFS for HBase and MapReduce
Atif Khan 2012-06-06, 18:15
Thanks to all who replied, especially Vladimir and Mathias!!! So if I understand this correctly, there is physical resource contention problem given that both MR and HBase are resource hungry. Therefore, when end-user SLAs are in place, performance guarantees may be compromised when HBase and MR share the same HDFS cluster (and other resources). According to Mathias's suggestion, on production HDFS cluster, we could throttle/limit the MR activity so that it has minimal impact on HBase's (realtime) performance. So far so good. Now my BIG question is about the BIG Data itself (no pun intended). If I do create two HDFS clusters (one for MR and one for HBase), and then given that HBase acting as data source and sink; Would I not be forced to move LARGE amounts of data between the two HDFS clusters? Given the size of the data, this could potentially congest the internal network on which the two independent HDFS clusters are deployed. Thoughts? -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018878.htmlSent from the HBase - Developer mailing list archive at Nabble.com.
+
Atif Khan 2012-06-06, 18:15
-
Re: Shared HDFS for HBase and MapReduce
Stack 2012-06-06, 19:04
On Wed, Jun 6, 2012 at 11:15 AM, Atif Khan <[EMAIL PROTECTED]> wrote: > Now my BIG question is about the BIG Data itself (no pun intended). If I do > create two HDFS clusters (one for MR and one for HBase), and then given that > HBase acting as data source and sink; Would I not be forced to move LARGE > amounts of data between the two HDFS clusters? Given the size of the data, > this could potentially congest the internal network on which the two > independent HDFS clusters are deployed. >
Yes St.Ack
+
Stack 2012-06-06, 19:04
-
Re: Shared HDFS for HBase and MapReduce
Atif Khan 2012-06-06, 19:23
This is beginning to sound like a catch-22 problem. I think I personally would lean towards a single HDFS (high performing) cluster that can be shared between various types of applications (realtime vs analytics). Then control/balance resource requirements for each application. This would work for scenarios where I can predict the different types of applications/workloads before hand. However, if for some reason the nature of workload is to shift, that could potentially throw off the whole resource equilibrium. Are there any additional Hadoop specific monitoring tools that can be deployed to predict resource/performance bottlenecks in advance (in addition to regular BMC type tools)? -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018881.htmlSent from the HBase - Developer mailing list archive at Nabble.com.
+
Atif Khan 2012-06-06, 19:23
-
Re: Shared HDFS for HBase and MapReduce
Amandeep Khurana 2012-06-06, 19:54
If your workload is only batch processing (MR), you don't need to separate the clusters in the first place. So, you don't have the problem of moving large amounts of data between clusters. Having a common HDFS cluster and using part of the nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of moving data from the HBase RS to the tasks you'll run as a part of your MR jobs if HBase is your source/sink. You will still be reading/writing over the network. On the other hand, if your workload is 'realtime' random reads/writes, the amount of data you are likely going to be accessing is small and therefore not expensive. Moreover, that's going to be accessed from a client application of some sort that is not a MR job. On Wednesday, June 6, 2012 at 12:23 PM, Atif Khan wrote: > This is beginning to sound like a catch-22 problem. I think I personally > would lean towards a single HDFS (high performing) cluster that can be > shared between various types of applications (realtime vs analytics). Then > control/balance resource requirements for each application. This would work > for scenarios where I can predict the different types of > applications/workloads before hand. However, if for some reason the nature > of workload is to shift, that could potentially throw off the whole resource > equilibrium. > > Are there any additional Hadoop specific monitoring tools that can be > deployed to predict resource/performance bottlenecks in advance (in addition > to regular BMC type tools)? > > -- > View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018881.html> Sent from the HBase - Developer mailing list archive at Nabble.com ( http://Nabble.com). > >
+
Amandeep Khurana 2012-06-06, 19:54
-
Re: Shared HDFS for HBase and MapReduce
Atif Khan 2012-06-06, 20:27
Thanks Amandeep! I think what I was saying that we are trying to support both types of workloads. That is realtime transactional workloads, and batch processing for data analysis. The big question being if a single HDFS cluster should be shared between the two workflows. The point that you are trying to make (if I am understanding you correctly) is of data "Locality". /Amandeep Khurana - "Having a common HDFS cluster and using part of the nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of moving data from the HBase RS to the tasks you'll run as a part of your MR jobs if HBase is your source/sink. You will still be reading/writing over the network." / When running MR jobs over HBase, data locality is provided by HBase (please see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, and also HBase the Definitive Guide by Lars George page 298 MapReduce Locality). In other words, the computation will be exported to where the data is, therefore limiting the need to transfer data over the network. Proper data locality has a big impact on the overall performance. So I believe that a common HDFS cluster does not imply logical segregation between HBase RS and Hadoop TTs. Therefore, your point seems in contradiction with Lars George's statement. Thoughts? -- View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018884.htmlSent from the HBase - Developer mailing list archive at Nabble.com.
+
Atif Khan 2012-06-06, 20:27
-
Re: Shared HDFS for HBase and MapReduce
Amandeep Khurana 2012-06-06, 21:05
When you run a MR job with HBase as a source/sink, you use the HBase API under the hood (get, put, scan). That API is how your client (in this case the map or reduce tasks) interact with the region servers. Data locality in a MR job is achieved by having the tasks run on the same physical nodes as the region servers so that communication over the network is minimal. The data locality for the region servers is a different conversation. That is about the region server process talking to the local datanode for its underlying HFiles rather than talking to remote ones. That has nothing to do with the MR jobs talking to HBase. On Wed, Jun 6, 2012 at 1:27 PM, Atif Khan <[EMAIL PROTECTED]>wrote: > Thanks Amandeep! > > I think what I was saying that we are trying to support both types of > workloads. That is realtime transactional workloads, and batch processing > for data analysis. The big question being if a single HDFS cluster should > be shared between the two workflows. > > The point that you are trying to make (if I am understanding you correctly) > is of data "Locality". > > /Amandeep Khurana - "Having a common HDFS cluster and using part of the > nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of > moving data from the HBase RS to the tasks you'll run as a part of your MR > jobs if HBase is your source/sink. You will still be reading/writing over > the network." > / > > When running MR jobs over HBase, data locality is provided by HBase (please > see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, > and > also HBase the Definitive Guide by Lars George page 298 MapReduce > Locality). > In other words, the computation will be exported to where the data is, > therefore limiting the need to transfer data over the network. Proper data > locality has a big impact on the overall performance. > > So I believe that a common HDFS cluster does not imply logical segregation > between HBase RS and Hadoop TTs. Therefore, your point seems in > contradiction with Lars George's statement. > > Thoughts? > > > -- > View this message in context: > http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018884.html> Sent from the HBase - Developer mailing list archive at Nabble.com. >
+
Amandeep Khurana 2012-06-06, 21:05
-
Re: Shared HDFS for HBase and MapReduce
Doug Meil 2012-06-06, 21:14
Regarding locality, it's not just Lars' stuff, it's in the RefGuide (see section 9.7.3)Š http://hbase.apache.org/book.html#regions.archre: "You will still be reading/writing over the network" This is definitely true as far as writes go because of the replicas (see the RefGuide for why), although I disagree on the read portion unless there is an exceptional case (which typically the result of an RS going down) On 6/6/12 4:27 PM, "Atif Khan" <[EMAIL PROTECTED]> wrote: >Thanks Amandeep! > >I think what I was saying that we are trying to support both types of >workloads. That is realtime transactional workloads, and batch processing >for data analysis. The big question being if a single HDFS cluster should >be shared between the two workflows. > >The point that you are trying to make (if I am understanding you >correctly) >is of data "Locality". > >/Amandeep Khurana - "Having a common HDFS cluster and using part of the >nodes as HBase RS and part as the Hadoop TTs doesn't solve the problem of >moving data from the HBase RS to the tasks you'll run as a part of your MR >jobs if HBase is your source/sink. You will still be reading/writing over >the network." >/ > >When running MR jobs over HBase, data locality is provided by HBase >(please >see http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html, >and >also HBase the Definitive Guide by Lars George page 298 MapReduce >Locality). >In other words, the computation will be exported to where the data is, >therefore limiting the need to transfer data over the network. Proper >data >locality has a big impact on the overall performance. > >So I believe that a common HDFS cluster does not imply logical segregation >between HBase RS and Hadoop TTs. Therefore, your point seems in >contradiction with Lars George's statement. > >Thoughts? > > >-- >View this message in context: > http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapRedu>ce-tp4018856p4018884.html >Sent from the HBase - Developer mailing list archive at Nabble.com. >
+
Doug Meil 2012-06-06, 21:14
+
Atif Khan 2012-06-06, 21:23
-
Re: Shared HDFS for HBase and MapReduce
Joey Echeverria 2012-06-06, 19:50
> Now my BIG question is about the BIG Data itself (no pun intended). If I do > create two HDFS clusters (one for MR and one for HBase), and then given that > HBase acting as data source and sink; Would I not be forced to move LARGE > amounts of data between the two HDFS clusters? Given the size of the data, > this could potentially congest the internal network on which the two > independent HDFS clusters are deployed. That's definitely true if HBase is the source and sink. Many organizations that need to do both real-time serving do something more akin to the following: 1) Split ingest of new data to feed both HBase and an HDFS/MR-only cluster. 2) Do batch processing on the HDFS/MR cluster 3) Push results either through the put-API or bulk load-API into HBase with any updates/new tables the batch processes create. This means that you only have to push the results to HBase and you can view that as just another ingest source. That way, it's built into the equation when you figure out how to size your HBase cluster. Also, if you do run MR directly over your HBase cluster (or on a shared HDFS) you must make sure to build that load into any sizing calculations and that you can either mask the latency spikes that might occur or accept them under your SLA. -Joey On Wed, Jun 6, 2012 at 2:15 PM, Atif Khan <[EMAIL PROTECTED]> wrote: > Thanks to all who replied, especially Vladimir and Mathias!!! > > So if I understand this correctly, there is physical resource contention > problem given that both MR and HBase are resource hungry. Therefore, when > end-user SLAs are in place, performance guarantees may be compromised when > HBase and MR share the same HDFS cluster (and other resources). > > According to Mathias's suggestion, on production HDFS cluster, we could > throttle/limit the MR activity so that it has minimal impact on HBase's > (realtime) performance. > > So far so good. > > Now my BIG question is about the BIG Data itself (no pun intended). If I do > create two HDFS clusters (one for MR and one for HBase), and then given that > HBase acting as data source and sink; Would I not be forced to move LARGE > amounts of data between the two HDFS clusters? Given the size of the data, > this could potentially congest the internal network on which the two > independent HDFS clusters are deployed. > > Thoughts? > > -- > View this message in context: http://apache-hbase.679495.n3.nabble.com/Shared-HDFS-for-HBase-and-MapReduce-tp4018856p4018878.html> Sent from the HBase - Developer mailing list archive at Nabble.com. -- Joey Echeverria Principal Solutions Architect Cloudera, Inc.
+
Joey Echeverria 2012-06-06, 19:50
|
|