Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase with EMR


Copy link to this message
-
Re: HBase with EMR
Correct - you can access any external service by using a custom jar.

On Sun, Mar 4, 2012 at 10:55 PM, Mohit Gupta
<[EMAIL PROTECTED]>wrote:

> HI All,
>
> Thank you so much. It has been a great help.
> As of now, I am exploring the idea of running an HBase cluster on EC2 ( EBS
> backed) and using EMR to run the heavy ad-hoc jobs.
>
> I got confused by reading in a couple of places ( esp this Amazon's EMR
> forum thread
> https://forums.aws.amazon.com/thread.jspa?messageID=238747𺒛 and the
> EMR doc. where it is mentioned at a no. of places that 'The service runs
> job flows in Amazon EC2 and stores input and output data in Amazon S3
> and/or Amazon DynamoDB.' ) that HBase can't be used with EMR. But now,
> after going through your replies, I understand it this way : For using Hive
> on EMR, input and output needs to be on S3 ( or now dyanmoDB as well). And
> for using other input/output sources ( like EC2 HBase cluster), need to
> write a custom jar for every single job/query.
>
> Please let me know if I have got this right or still missing something.
>
> Also, Interesting idea of running a transient HBase besides the normal
> cluster.
>
>
>
> On Sun, Mar 4, 2012 at 2:50 AM, Amandeep Khurana <[EMAIL PROTECTED]> wrote:
>
> > Mohit,
> >
> > Adding to what Andy and Vaibhav have listed - you'll need to ensure that
> > the Hadoop versions running in EMR and your HBase cluster are compatible
> if
> > you want to run MapReduce from EMR onto an external HBase cluster.
> >
> > If you choose to run HBase on your EMR cluster and don't want it to tear
> > down on job completion, start the cluster with the alive flag. However,
> the
> > moment the health of your master node goes bad (does not happen very
> often,
> > but is not unheard of either. It's more common in EC2 than physical
> > hardware), the EMR cluster will terminate. Read up on the semantics of
> the
> > alive flag and termination protection to understand the behavior better.
> >
> > Another thing to be aware of while running HBase on EMR, you will most
> > likely be limited to keeping your HBase master and ZK on the node running
> > your Namenode and Jobtracker (aka EMR master). You can run multiple
> > masters, zk and probably or have separate nodes outside of the existing
> EMR
> > cluster but you will need to do extra work (like adding nodes to the same
> > security groups, spinning up instances separately after the EMR cluster
> is
> > up).
> >
> > It comes down to specifying your requirements clearly and then figuring
> out
> > the right solution. :) You'll get plenty help on the mailing list.
> >
> > Hope this helps.
> >
> > -Amandeep
> >
> > On Sat, Mar 3, 2012 at 7:12 PM, Vaibhav Puranik <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Mohit,
> > >
> > > I have written the blogpost.
> > >
> > > EMR is nothing but map reduce. HBase provides TableInputFormat. With
> > > TableInputFormat and TableMapReduceUtil (
> > >
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
> > > )
> > > class, you can specify your source as HBase - hosted anywhere as long
> as
> > > it's accessible through internet. In doing so if the HBase is not
> hosted
> > on
> > > the same Hadoop cluster (which it won't be in case of an EMR job), you
> > will
> > > be sacficing data locality (We are okay with that).
> > >
> > > Regards,
> > > Vaibhav
> > >
> > > On Sat, Mar 3, 2012 at 9:21 AM, Andrew Purtell <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > I think there are a couple of things conflated here. Let me make four
> > > > brief points and then feel free to follow up where you would like
> more
> > > > information.
> > > >
> > > > 1) Many run HBase (and self-hosted Hadoop) on EC2. These clusters
> have
> > > > their own HDFS on EBS or instance store volumes.
> > > >
> > > > 2) You cannot run HBase backed by S3. Search on other HBase user list
> > > > emails on the subject.  But this of course does not mean you cannot
> run
> > > > HBase on EC2. (See point 1.)