Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> My notes for running Pig from EC2 to EMR


Copy link to this message
-
Re: My notes for running Pig from EC2 to EMR
Amazon supports pig 0.9.1 now. Take a look-
http://aws.amazon.com/releasenotes/Elastic-MapReduce/1044996466833146

Also, I am not very sure about copying EMR jars to EC2. You should check
that with Amazon.

Thanks,
Aniket

On Fri, Dec 16, 2011 at 12:02 PM, Ayon Sinha <[EMAIL PROTECTED]> wrote:

> This might get outdated quickly as EMR upgrades the Pig version and Pig
> 0.9.1 is being used by everyone anyway. But here is my write-up for your
> review:
>
> The main obstacles for running Pig on Elastic MapReduce (EMR) are:
>
>        * Pig version installed on EMR is older than 0.8.1. (By some
> accounts EMR just upgraded their Pig version to 0.9.1)
>        * Hadoop Version on EMR might not match the one Pig is using.
>        * The user you’re running Pig as might not have permissions on the
> HDFS on the EMR cluster.
>
> How to solve each one of these issues:
>        1. We will not be using Pig that is installed on EMR. We will use
> an EC2 instance as the Pig client which compiles the Pig Scripts and
> submits MapReduce jobs to the Hadoop on EMR. For this to work, the Hadoop
> version that Pig is using and whats installed on EMR must match (or at
> least be backward compatible). i.e. EMR hadoop version should be >= Pig’s
> Hadoop version.
>        2. The best way to do this is to copy over the Hadoop directory
> from one of the EMR instances to the Pig client EC2 machine. The next
> problem is to make Pig use this hadoop rather than the one its been using.
> For Pig version 8.1 or earlier Pig jar has hadoop classes bundled within so
> any attempt at making Pig use the jars downloaded from EMR fails. The
> solution was to use Pig 0.9.1 which had a pigwithouthadoop.jar. When you
> use this it will use whichever hadoop you make HADOOP_HOME point to, which
> in this case will be the directory where you downloaded the EMR classes and
> configs.
>        3. Now that you are using Pig 0.9.1 your version might have a big
> in the pig executable (in <Pig install dir>/bin )script where it does not
> respect the HADOOP_HOME. So patch the script.
>        4. Now you want Pig to be using the Jobtracker and Namenode of the
> EMR cluster you want the computation to be on. Follow one of the usual ways
> to do this:
>        1. -Dmapred.job.tracker=<jt:port> -Dfs.default.name=<nn:port>. The
> jt & nn IP will be the internal 10.xxx.xxx.xxx IP of the master EMR node.
> ports are 9000 and 9001 for the NN & JT respectively.
>        2. pig.properties file in conf dir.
>        3. change core-site.xml & mapred-site.xml in the local
> $HADOOP_HOME/conf dir.
>
> The precedence is a > b > c
>        1. Now Pig will start but will fail if the use you are running Pig
> as does not match default EMR user which is hadoop. So this is what I do on
> the EMR:
>        1. hadoop dfs -fs hdfs://<EMR internal ip 10.xxx.xxx.xxx>:9000
> -mkdir /user/piguser;hadoop dfs -fs hdfs://<EMR internal ip
> 10.xxx.xxx.xxx>:9000 -chmod -R 777  /
>        2. You can argue that 777 is too generous, but I don't care as its
> the temporary files that are stored and they are gone once my instance is
> gone. All my real data is on S3.
> Now you should be all set.
> Only steps 4 & 5 need to be done every time you start you new EMR cluster.
>
>
>
>  -Ayon
>

--
"...:::Aniket:::... Quetzalco@tl"
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB