Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> HBase with EMR


Copy link to this message
-
Re: HBase with EMR
Correct - you can access any external service by using a custom jar.

On Sun, Mar 4, 2012 at 10:55 PM, Mohit Gupta
<[EMAIL PROTECTED]>wrote:

> HI All,
>
> Thank you so much. It has been a great help.
> As of now, I am exploring the idea of running an HBase cluster on EC2 ( EBS
> backed) and using EMR to run the heavy ad-hoc jobs.
>
> I got confused by reading in a couple of places ( esp this Amazon's EMR
> forum thread
> https://forums.aws.amazon.com/thread.jspa?messageID=238747𺒛 and the
> EMR doc. where it is mentioned at a no. of places that 'The service runs
> job flows in Amazon EC2 and stores input and output data in Amazon S3
> and/or Amazon DynamoDB.' ) that HBase can't be used with EMR. But now,
> after going through your replies, I understand it this way : For using Hive
> on EMR, input and output needs to be on S3 ( or now dyanmoDB as well). And
> for using other input/output sources ( like EC2 HBase cluster), need to
> write a custom jar for every single job/query.
>
> Please let me know if I have got this right or still missing something.
>
> Also, Interesting idea of running a transient HBase besides the normal
> cluster.
>
>
>
> On Sun, Mar 4, 2012 at 2:50 AM, Amandeep Khurana <[EMAIL PROTECTED]> wrote:
>
> > Mohit,
> >
> > Adding to what Andy and Vaibhav have listed - you'll need to ensure that
> > the Hadoop versions running in EMR and your HBase cluster are compatible
> if
> > you want to run MapReduce from EMR onto an external HBase cluster.
> >
> > If you choose to run HBase on your EMR cluster and don't want it to tear
> > down on job completion, start the cluster with the alive flag. However,
> the
> > moment the health of your master node goes bad (does not happen very
> often,
> > but is not unheard of either. It's more common in EC2 than physical
> > hardware), the EMR cluster will terminate. Read up on the semantics of
> the
> > alive flag and termination protection to understand the behavior better.
> >
> > Another thing to be aware of while running HBase on EMR, you will most
> > likely be limited to keeping your HBase master and ZK on the node running
> > your Namenode and Jobtracker (aka EMR master). You can run multiple
> > masters, zk and probably or have separate nodes outside of the existing
> EMR
> > cluster but you will need to do extra work (like adding nodes to the same
> > security groups, spinning up instances separately after the EMR cluster
> is
> > up).
> >
> > It comes down to specifying your requirements clearly and then figuring
> out
> > the right solution. :) You'll get plenty help on the mailing list.
> >
> > Hope this helps.
> >
> > -Amandeep
> >
> > On Sat, Mar 3, 2012 at 7:12 PM, Vaibhav Puranik <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Mohit,
> > >
> > > I have written the blogpost.
> > >
> > > EMR is nothing but map reduce. HBase provides TableInputFormat. With
> > > TableInputFormat and TableMapReduceUtil (
> > >
> > >
> >
> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableMapReduceUtil.html
> > > )
> > > class, you can specify your source as HBase - hosted anywhere as long
> as
> > > it's accessible through internet. In doing so if the HBase is not
> hosted
> > on
> > > the same Hadoop cluster (which it won't be in case of an EMR job), you
> > will
> > > be sacficing data locality (We are okay with that).
> > >
> > > Regards,
> > > Vaibhav
> > >
> > > On Sat, Mar 3, 2012 at 9:21 AM, Andrew Purtell <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > I think there are a couple of things conflated here. Let me make four
> > > > brief points and then feel free to follow up where you would like
> more
> > > > information.
> > > >
> > > > 1) Many run HBase (and self-hosted Hadoop) on EC2. These clusters
> have
> > > > their own HDFS on EBS or instance store volumes.
> > > >
> > > > 2) You cannot run HBase backed by S3. Search on other HBase user list
> > > > emails on the subject.  But this of course does not mean you cannot
> run
> > > > HBase on EC2. (See point 1.)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB