We run a multi-AZ RDS instance hosting our metastore, which is shared by multiple EMR clusters. We utilize RDS's backup/snapshot feature, although we haven't encountered a need to restore from backup for real yet (knock on wood).
From: Sam Wilson [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, March 06, 2012 7:15 PM
To: [EMAIL PROTECTED]
Subject: Re: Amazon EMR Best Practices for Hive metastore
We also do #4. Initially we had lots of conversations about all the other options and we should do this or that... Ultimately we focused on just going live as quickly as possible and getting more involved in the setup later.
Since then the only thing we've needed to do is hack a few o the baseline scripts used by emr to launch hive so that it uses more heap. We definitely have a few pain points around partition recovery but those are things inherent to hive and not emr.
I should note that we don't trust our emr cluster to stick around so we design for it to just die. You can't treat it like a regular Hadoop cluster. We made launching a new one an easy process and have decoupled hive from the ux so that it's fully asynchronous.
So far, big wins and no complaints.
Sent from my iPhone
On Mar 6, 2012, at 10:02 PM, Jeff Sternberg <[EMAIL PROTECTED]> wrote:
> We do 4), basically. We have a simple hive script that does all the "create external table" statements, and we run that script as step 1 of the EMR jobs we spin up. Then our "real" processing takes over in step 2 and beyond. We're only working with about 50 tables, so it's pretty manageable. A side benefit is that we can put this create-table script under source control to track our schema changes over time.
> Jeff Sternberg
> S&P Capital IQ
> -----Original Message-----
> From: Mark Grover [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, March 06, 2012 9:54 PM
> To: [EMAIL PROTECTED]
> Cc: Baiju Devani; Denys Berestyuk
> Subject: Amazon EMR Best Practices for Hive metastore
> Hi all,
> I am trying to get an idea of what people do for setting up Hive metastore when using Amazon EMR.
> For those of you using Amazon EMR:
> 1) Do you have a dedicated RDS instance external to your EMR Hive+Hadoop cluster that you use as a persistent metastore for all your cluster instantiations?
> 2) Do you use the MySQL DB that comes pre-installed on the master node and export its data (on cluster tear down) to something like S3 and import it from S3 during cluster bring up?
> 3) Do you use a local installation of Hive (instead of that on EMR) so that you could make use of an in-house dedicated metastore while utilizing Hadoop cluster on EMR? (i.e. local Hive + EMR Hadoop)
> 4) Do you do something really simple and naive like scripting up all your "create external table" commands and running them every time you bring up a cluster?
> Or, do you do something else not mentioned above?:-)
> Thank you in advance for sharing!
> Mark Grover, Business Intelligence Analyst OANDA Corporation
> www: oanda.com www: fxtrade.com
> "Best Trading Platform" - World Finance's Forex Awards 2009.
> "The One to Watch" - Treasury Today's Adam Smith Awards 2009.