-Re: My notes for running Pig from EC2 to EMR
Aniket Mokashi 2011-12-17, 00:11
Amazon supports pig 0.9.1 now. Take a look-
Also, I am not very sure about copying EMR jars to EC2. You should check
that with Amazon.
On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha <[EMAIL PROTECTED]> wrote:
> This might get outdated quickly as EMR upgrades the Pig version and Pig
> 0.9.1 is being used by everyone anyway. But here is my write-up for your
> The main obstacles for running Pig on Elastic MapReduce (EMR) are:
> * Pig version installed on EMR is older than 0.8.1. (By some
> accounts EMR just upgraded their Pig version to 0.9.1)
> * Hadoop Version on EMR might not match the one Pig is using.
> * The user you’re running Pig as might not have permissions on the
> HDFS on the EMR cluster.
> How to solve each one of these issues:
> 1. We will not be using Pig that is installed on EMR. We will use
> an EC2 instance as the Pig client which compiles the Pig Scripts and
> submits MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop
> version that Pig is using and whats installed on EMR must match (or at
> least be backward compatible). i.e. EMR hadoop version should be >= Pig’s
> Hadoop version.
> 2. The best way to do this is to copy over the Hadoop directory
> from one of the EMR instances to the Pig client EC2 machine. The next
> problem is to make Pig use this hadoop rather than the one its been using.
> For Pig version 8.1 or earlier Pig jar has hadoop classes bundled within so
> any attempt at making Pig use the jars downloaded from EMR fails. The
> solution was to use Pig 0.9.1 which had a pigwithouthadoop.jar. When you
> use this it will use whichever hadoop you make HADOOP_HOME point to, which
> in this case will be the directory where you downloaded the EMR classes and
> 3. Now that you are using Pig 0.9.1 your version might have a big
> in the pig executable (in <Pig install dir>/bin )script where it does not
> respect the HADOOP_HOME. So patch the script.
> 4. Now you want Pig to be using the Jobtracker and Namenode of the
> EMR cluster you want the computation to be on. Follow one of the usual ways
> to do this:
> 1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The
> jt & nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node.
> ports are 9000 and 9001 for the NN & JT respectively.
> 2. pig.properties file in conf dir.
> 3. change core-site.xml & mapred-site.xml in the local
> $HADOOP_HOME/conf dir.
> The precedence is a > b > c
> 1. Now Pig will start but will fail if the use you are running Pig
> as does not match default EMR user which is hadoop. So this is what I do on
> the EMR:
> 1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000
> -mkdir /user/piguser;hadoop dfs -fs hdfs://<EMR internal ip
> 10.xxx.xxx.xxx>:9000 -chmod -R 777 /
> 2. You can argue that 777 is too generous, but I don't care as its
> the temporary files that are stored and they are gone once my instance is
> gone. All my real data is on S3.
> Now you should be all set.
> Only steps 4 & 5 need to be done every time you start you new EMR cluster.