Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Re: Doubts related to Amazon EMR

Copy link to this message
Re: Doubts related to Amazon EMR
Hi Bhavesh,
If you copy your jar over to master node of your EMR cluster and install Sqoop like Kyle suggested, you can run your jar on the master node, just like you did on your local cluster before. Just make sure that the Hive Jdbc drivers are available to jar and that you connect to the Hive server on localhost on the appropriate port.

Long term, you might not want to run your Hive jdbc code on the master node of the cluster (to not burden it). Then, you can run your JDBC code on your local machine that connects to a Hive server running on EMR remotely. In that case, you might have to separate your Sqoop commands to still run on master node of the cluster but you can deal with it when you get to it:-)
Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
----- Original Message -----
From: "Bhavesh Shah" <[EMAIL PROTECTED]>
Sent: Tuesday, April 24, 2012 1:04:01 AM
Subject: Re: Doubts related to Amazon EMR
Thanks all for their answers.
But I want to ask one more thing that:
1) I have written a program (my task) which contains Hive JDBC code and code(commands of SQOOP) for importing the tables and exporting too.
If I create JAR of my program and put it on EMR, then should I need to do some extra thing like writing mappers/reducers for execution of program?
Or Just by simply creating the JAR and run it?

Bhavesh Shah

On Tue, Apr 24, 2012 at 7:20 AM, Mark Grover < [EMAIL PROTECTED] > wrote:
Hi Bhavesh,

To answer your questions:

1) S3 terminology uses the word "object" and I am sure they have good reasons as to why but for us Hive'ers, an S3 object is the same as a file stored on S3. The complete path to the file would be what Amazon calls the S3 "key" and the corresponding value would be the contents of the file e.g. s3://my_bucket/tables/log.txt would be the key and the actual content of the file would be S3 object. You can use the AWS web console to create a bucket and use tools like S3cmd ( http://s3tools.org/s3cmd ) to put data onto S3.

However, like Kyle said, you don't necessarily need to use S3. S3 is typically only used when you want to have a persistent storage of data. Most people would store their input logs/files on S3 for Hive processing and also store the final aggregations and results on S3 for future retrieval. If you are just temporarily loading some data into Hive, processing it and exporting it out, you don't have to worry about S3. The nodes that form your cluster have ephemeral storage that forms the HDFS. You can just use that. The only side effect is that you will loose all your data in HDFS once you terminate the cluster. If that's ok, don't worry about S3.

EMR instances are basically EC2 instances with some additional setup done on them. Transferring data between EC2 and EMR instances should be simple, I'd think. If your data is present in EBS volumes, you could look into adding an EMR bootstrap action that mounts that same EBS volume onto your EMR instances. It might be easier if you can do it without all the fancy mounting business though.

Also, keep in mind that there might be costs for data transfers across Amazon data centers, you would want to keep your S3 buckets, EMR cluster and EC2 instances in the same region, if at all possible. Within the same region, there shouldn't be any extra transfer costs.

2) Yeah, EMR supports custom jars. You can specify them at the time you create your cluster. This should require minimal porting changes to your jar itself since it runs on Hadoop and Hive which are the same as (well, close enough to) what you installed your local cluster vs. what's installed on EMR.

3) Like Kyle said, Sqoop with EMR should be OK.

Good luck!
Mark Grover, Business Intelligence Analyst
OANDA Corporation


"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.
From: "Kyle Mulka" < [EMAIL PROTECTED] >
Sent: Monday, April 23, 2012 10:55:36 AM
Subject: Re: Doubts related to Amazon EMR
It is possible to install Sqoop on AWS EMR. I've got some scripts I can publish later. You are not required to use S3 to store files and can use the local (temporary) HDFS instead. After you have Sqoop installed, you can import your data with it into HDFS, run your calculations in HDFS, then export your data back out using Sqoop again.

Kyle Mulka

On Apr 23, 2012, at 8:42 AM, Bhavesh Shah < [EMAIL PROTECTED] > wrote:
Hello all,
I want to deploy my task on Amazon EMR. But as I am new to Amazon Web Services I am confused in understanding the concepts.

My Use Case:

I want to import the large data from EC2 through SQOOP into the Hive. Imported data in Hive will get processed in Hive by applying some algorithm and will generate some result (in table form, in Hive only). And generated result will be exported back to Ec2 again through SQOOP only.

I am new to Amazon Web Services and want to implement this use case with the help of AWS EMR. I have implemented it on local machine.

I have read some links related to AWS EMR for launching the instance and about what is EMR, How it works and etc... I have some doubts about EMR like:
1) EMR uses S3 Buckets, which holds Input and Output data Hadoop Processing (in the form of Objects). ---> I didn't get How to store the data in the form of Objects on S3 (My data will be files)

2) As already said I have implemented a task for my use case in Java. So If I create the JAR of my program and create the Job Flow with Custom JAR. Will it be possible to implement like this or do need to do some thing extra for that?

3) As I said