|
Ayon Sinha
2011-12-02, 00:12
Jonathan Coveney
2011-12-02, 00:17
Ayon Sinha
2011-12-02, 00:27
Daniel Dai
2011-12-02, 09:06
Ayon Sinha
2011-12-02, 16:15
Ayon Sinha
2011-12-03, 04:01
Ayon Sinha
2011-12-05, 19:08
Thejas Nair
2011-12-05, 20:17
|
-
Trying to submit Pig job to Amazon EMRAyon Sinha 2011-12-02, 00:12
Hi,
I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local mode. So now I want to configure the NN & JT such that the job goes to the EMR cluster I've spun up. I have a local pigconf directory with the Hadoop XML files and pointed HADOOP_CONF_DIR and PIG_CLASSPATH set to it. in core-site.xml I have <property> <name>fs.default.name</name> <value>hdfs://10.116.83.74:9000</value> </property> On mapred-site.xml I have: <configuration> <property> <name>mapred.job.tracker</name> <value>10.116.83.74:9001</value> </property> Now Pig tries to connect and I get 2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging error messages to: /home/mashlogic/ayon/pigconf/pig_1322784657959.log 2011-12-01 16:10:58,950 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.116.83.74:9000 2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage log file says: Error before Pig is launched ---------------------------- ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.<init>(PigServer.java:226) at org.apache.pig.PigServer.<init>(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) at org.apache.hadoop.ipc.Client.call(Client.java:1110) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) =============================================================================== My EMR is running Hive jobs just fine. So if I can get it to run my Pig jobs, I'll be happy. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions.
-
Re: Trying to submit Pig job to Amazon EMRJonathan Coveney 2011-12-02, 00:17
Usually this means that the version of Hadoop in pig mismatches with the
version of Hadoop you're running. I'd do ant jar-withouthadoop and point it at the HAdoop on EC2 using the hadoopless pig jar 2011/12/1 Ayon Sinha <[EMAIL PROTECTED]> > Hi, > I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local > mode. So now I want to configure the NN & JT such that the job goes to the > EMR cluster I've spun up. > I have a local pigconf directory with the Hadoop XML files and pointed > HADOOP_CONF_DIR and PIG_CLASSPATH set to it. > > in core-site.xml I have > > <property> > <name>fs.default.name</name> > <value>hdfs://10.116.83.74:9000</value> > </property> > > > On mapred-site.xml I have: > <configuration> > <property> > <name>mapred.job.tracker</name> > <value>10.116.83.74:9001</value> > </property> > > > Now Pig tries to connect and I get > 2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging error > messages to: /home/mashlogic/ayon/pigconf/pig_1322784657959.log > 2011-12-01 16:10:58,950 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: hdfs://10.116.83.74:9000 > 2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: > Unexpected internal error. Failed to create DataStorage > > > log file says: > > Error before Pig is launched > ---------------------------- > ERROR 2999: Unexpected internal error. Failed to create DataStorage > > java.lang.RuntimeException: Failed to create DataStorage > at > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) > at > org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) > at org.apache.pig.impl.PigContext.connect(PigContext.java:183) > at org.apache.pig.PigServer.<init>(PigServer.java:226) > at org.apache.pig.PigServer.<init>(PigServer.java:215) > at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) > at org.apache.pig.Main.run(Main.java:452) > at org.apache.pig.Main.main(Main.java:107) > Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on > local exception: java.io.EOFException > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) > at org.apache.hadoop.ipc.Client.call(Client.java:1110) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy0.getProtocolVersion(Unknown Source) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:398) > at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) > at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:111) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:213) > at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:180) > at > org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:89) > at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1514) > at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:67) > at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:1548) > at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1530) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228) > at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:111) > at > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) > ... 9 more > Caused by: java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:375) > at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:815) > at org.apache.hadoop.ipc.Client$Connection.run(Client.java:724) > > ===============================================================================> > My EMR is running Hive jobs just fine. So if I can get it to run my Pig > jobs, I'll be happy. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions.
-
Re: Trying to submit Pig job to Amazon EMRAyon Sinha 2011-12-02, 00:27
Well, I should not need Pig to connect to HDFS. Its should use S3, so I changed fs.default.name to
s3n://<mybucketname> and now I get the Grunt prompt. The next problem I'm facing is when I say, a = load 's3n://<mydatabucket>/blah/foo/day=20111127' using PigStorage(); I get 2011-12-01 16:22:01,948 [main] WARN org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/user%2Fmymapred-user' - Unexpected response code 404, expected 200 2011-12-01 16:22:02,024 [main] WARN org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404, expected 200 2011-12-01 16:22:02,038 [main] WARN org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - Unexpected response code 404, expected 200 2011-12-01 16:22:02,038 [main] WARN org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - Received error response with XML message 2011-12-01 16:22:02,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 6007: Unable to check name s3n://<mybucketname/user/mymapred-user What is it trying to check? Does it need some storage to write intermediate files to? -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Jonathan Coveney <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Ayon Sinha <[EMAIL PROTECTED]> Sent: Thursday, December 1, 2011 4:17 PM Subject: Re: Trying to submit Pig job to Amazon EMR Usually this means that the version of Hadoop in pig mismatches with the version of Hadoop you're running. I'd do ant jar-withouthadoop and point it at the HAdoop on EC2 using the hadoopless pig jar 2011/12/1 Ayon Sinha <[EMAIL PROTECTED]> Hi, >I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local mode. So now I want to configure the NN & JT such that the job goes to the EMR cluster I've spun up. >I have a local pigconf directory with the Hadoop XML files and pointed HADOOP_CONF_DIR and PIG_CLASSPATH set to it. > >in core-site.xml I have > > <property> > <name>fs.default.name</name> > <value>hdfs://10.116.83.74:9000</value> > </property> > > >On mapred-site.xml I have: ><configuration> > <property> > <name>mapred.job.tracker</name> > <value>10.116.83.74:9001</value> > </property> > > >Now Pig tries to connect and I get >2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging error messages to: /home/mashlogic/ayon/pigconf/pig_1322784657959.log >2011-12-01 16:10:58,950 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://10.116.83.74:9000 >2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: Unexpected internal error. Failed to create DataStorage > > >log file says: > >Error before Pig is launched >---------------------------- >ERROR 2999: Unexpected internal error. Failed to create DataStorage > >java.lang.RuntimeException: Failed to create DataStorage >at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) >at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) >at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) >at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) >at org.apache.pig.impl.PigContext.connect(PigContext.java:183) >at org.apache.pig.PigServer.<init>(PigServer.java:226) >at org.apache.pig.PigServer.<init>(PigServer.java:215) >at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) >at org.apache.pig.Main.run(Main.java:452) >at org.apache.pig.Main.main(Main.java:107) >Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local exception: java.io.EOFException >at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) >at org.apache.hadoop.ipc.Client.call(Client.java:1110) >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >at $Proxy0.getProtocolVersion(Unknown Source)
-
Re: Trying to submit Pig job to Amazon EMRDaniel Dai 2011-12-02, 09:06
Pig should support this syntax. Do you share your s3 data to public?
Otherwise do you have fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined? Daniel On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[EMAIL PROTECTED]> wrote: > Well, I should not need Pig to connect to HDFS. Its should use S3, so I > changed fs.default.name to > s3n://<mybucketname> and now I get the Grunt prompt. > > The next problem I'm facing is when I say, > a = load 's3n://<mydatabucket>/blah/foo/day=20111127' using PigStorage(); > > > I get > > 2011-12-01 16:22:01,948 [main] WARN > org.jets3t.service.impl.rest.httpclient.RestS3Service - Response > '/user%2Fmymapred-user' - Unexpected response code 404, expected 200 > 2011-12-01 16:22:02,024 [main] WARN > org.jets3t.service.impl.rest.httpclient.RestS3Service - Response > '/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404, > expected 200 > 2011-12-01 16:22:02,038 [main] WARN > org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - > Unexpected response code 404, expected 200 > 2011-12-01 16:22:02,038 [main] WARN > org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/' - > Received error response with XML message > 2011-12-01 16:22:02,045 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 6007: Unable to check name s3n://<mybucketname/user/mymapred-user > > > What is it trying to check? Does it need some storage to write > intermediate files to? > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > > > ________________________________ > From: Jonathan Coveney <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Ayon Sinha <[EMAIL PROTECTED]> > Sent: Thursday, December 1, 2011 4:17 PM > Subject: Re: Trying to submit Pig job to Amazon EMR > > > Usually this means that the version of Hadoop in pig mismatches with the > version of Hadoop you're running. I'd do ant jar-withouthadoop and point it > at the HAdoop on EC2 using the hadoopless pig jar > > > 2011/12/1 Ayon Sinha <[EMAIL PROTECTED]> > > Hi, > >I have a EC2 box setup with Pig 0.8.1 which can run my jobs fine in local > mode. So now I want to configure the NN & JT such that the job goes to the > EMR cluster I've spun up. > >I have a local pigconf directory with the Hadoop XML files and pointed > HADOOP_CONF_DIR and PIG_CLASSPATH set to it. > > > >in core-site.xml I have > > > > <property> > > <name>fs.default.name</name> > > <value>hdfs://10.116.83.74:9000</value> > > </property> > > > > > >On mapred-site.xml I have: > ><configuration> > > <property> > > <name>mapred.job.tracker</name> > > <value>10.116.83.74:9001</value> > > </property> > > > > > >Now Pig tries to connect and I get > >2011-12-01 16:10:58,009 [main] INFO org.apache.pig.Main - Logging error > messages to: /home/mashlogic/ayon/pigconf/pig_1322784657959.log > >2011-12-01 16:10:58,950 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: hdfs://10.116.83.74:9000 > >2011-12-01 16:10:59,814 [main] ERROR org.apache.pig.Main - ERROR 2999: > Unexpected internal error. Failed to create DataStorage > > > > > >log file says: > > > >Error before Pig is launched > >---------------------------- > >ERROR 2999: Unexpected internal error. Failed to create DataStorage > > > >java.lang.RuntimeException: Failed to create DataStorage > >at > org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) > >at > org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) > >at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) > >at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) > >at org.apache.pig.impl.PigContext.connect(PigContext.java:183) > >at org.apache.pig.PigServer.<init>(PigServer.java:226) > >at org.apache.pig.PigServer.<init>(PigServer.java:215) > >at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55)
-
Re: Trying to submit Pig job to Amazon EMRAyon Sinha 2011-12-02, 16:15
Yes, I do that the awsSecretAccessKey defined, correct, I believe.
To test: mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/ Found 29 items -rwxrwxrwx 1 139148530 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaa.tsv.gz -rwxrwxrwx 1 138086136 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xab.tsv.gz -rwxrwxrwx 1 146165298 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xac.tsv.gz -rwxrwxrwx 1 152491197 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xad.tsv.gz -rwxrwxrwx 1 154673351 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xae.tsv.gz -rwxrwxrwx 1 155920643 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaf.tsv.gz -rwxrwxrwx 1 156468098 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xag.tsv.gz -rwxrwxrwx 1 157626894 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xah.tsv.gz -rwxrwxrwx 1 158872953 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xai.tsv.gz -rwxrwxrwx 1 158108620 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xaj.tsv.gz -rwxrwxrwx 1 158439002 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xak.tsv.gz -rwxrwxrwx 1 158618811 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xal.tsv.gz -rwxrwxrwx 1 159421273 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xam.tsv.gz -rwxrwxrwx 1 158402981 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xan.tsv.gz -rwxrwxrwx 1 157375232 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xao.tsv.gz -rwxrwxrwx 1 158516929 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xap.tsv.gz -rwxrwxrwx 1 158029022 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xaq.tsv.gz -rwxrwxrwx 1 159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz -rwxrwxrwx 1 160148777 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xas.tsv.gz -rwxrwxrwx 1 160844640 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xat.tsv.gz -rwxrwxrwx 1 161679424 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xau.tsv.gz -rwxrwxrwx 1 159240120 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xav.tsv.gz -rwxrwxrwx 1 160124996 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaw.tsv.gz -rwxrwxrwx 1 159158447 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xax.tsv.gz -rwxrwxrwx 1 158436630 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xay.tsv.gz -rwxrwxrwx 1 158518938 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaz.tsv.gz -rwxrwxrwx 1 156520868 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xba.tsv.gz -rwxrwxrwx 1 154253795 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbb.tsv.gz -rwxrwxrwx 1 142244585 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbc.tsv.gz Trying to run something as simple as a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using PigStorage(); s = sample a 0.001; dump s; gives >ERROR 2999: Unexpected internal error. Failed to create DataStorage > >java.lang.RuntimeException: Failed to create DataStorage >at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) >at org.apache.pig.backend.hadoop.datastorage.HDataStorage.<init>(HDataStorage.java:58) >at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) >at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) >at org.apache.pig.impl.PigContext.connect(PigContext.java:183) >at org.apache.pig.PigServer.<init>(PigServer.java:226) >at org.apache.pig.PigServer.<init>(PigServer.java:215) >at org.apache.pig.tools.grunt.Grunt.<init>(Grunt.java:55) >at org.apache.pig.Main.run(Main.java:452) >at org.apache.pig.Main.main(Main.java:107) >Caused by: java.io.IOException: Call to /10.116.83.74:9000 failed on local exception: java.io.EOFException >at org.apache.hadoop.ipc.Client.wrapException(Client.java:1142) >at org.apache.hadoop.ipc.Client.call(Client.java:1110) >at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >at $Proxy0.getProtocolVersion(Unknown Source) -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Daniel Dai <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Ayon Sinha <[EMAIL PROTECTED]> Sent: Friday, December 2, 2011 1:06 AM Subject: Re: Trying to submit Pig job to Amazon EMR Pig should support this syntax. Do you share your s3 data to public? Otherwise do you have fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined? Daniel On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[EMAIL PROTECTED]> wrote: Well, I should not need Pig to connect to HDFS. Its should use S3, so I changed fs.default.name to
-
Re: Trying to submit Pig job to Amazon EMRAyon Sinha 2011-12-03, 04:01
So with the help of Daniel and Thejas, we figured out the problem. The root cause was the mismatch of Hadoop versions between EMR and the Pig client. When I copied over all the hadoop jars from the EMR box to the EC2 Pig 0.8.1 client EC2 box, it still did not resolve the issue. The root cause of that was that,
Pig 0.8.1 uses hadoop classes from within its own packaged jar. Version 0.9 has pigwithouthadoop jar so we used that. Also, the bin/pig script has a bug that resets HADOOP_HOME. The script was also patched to fix this. Then also Pig will look for /user/<username> directory in the HDFS of the EMR cluster. So one way is to create the directory in the HDFS and then let Pig do its job. I'm not sure why Pig can't create that directory if its doesn't exist. Will investigate that. Thanks to Daniel & Thejas once again. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Ayon Sinha <[EMAIL PROTECTED]> To: Daniel Dai <[EMAIL PROTECTED]>; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Friday, December 2, 2011 8:15 AM Subject: Re: Trying to submit Pig job to Amazon EMR Yes, I do that the awsSecretAccessKey defined, correct, I believe. To test: mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/ Found 29 items -rwxrwxrwx 1 139148530 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaa.tsv.gz -rwxrwxrwx 1 138086136 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xab.tsv.gz -rwxrwxrwx 1 146165298 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xac.tsv.gz -rwxrwxrwx 1 152491197 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xad.tsv.gz -rwxrwxrwx 1 154673351 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xae.tsv.gz -rwxrwxrwx 1 155920643 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaf.tsv.gz -rwxrwxrwx 1 156468098 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xag.tsv.gz -rwxrwxrwx 1 157626894 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xah.tsv.gz -rwxrwxrwx 1 158872953 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xai.tsv.gz -rwxrwxrwx 1 158108620 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xaj.tsv.gz -rwxrwxrwx 1 158439002 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xak.tsv.gz -rwxrwxrwx 1 158618811 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xal.tsv.gz -rwxrwxrwx 1 159421273 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xam.tsv.gz -rwxrwxrwx 1 158402981 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xan.tsv.gz -rwxrwxrwx 1 157375232 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xao.tsv.gz -rwxrwxrwx 1 158516929 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xap.tsv.gz -rwxrwxrwx 1 158029022 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xaq.tsv.gz -rwxrwxrwx 1 159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz -rwxrwxrwx 1 160148777 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xas.tsv.gz -rwxrwxrwx 1 160844640 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xat.tsv.gz -rwxrwxrwx 1 161679424 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xau.tsv.gz -rwxrwxrwx 1 159240120 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xav.tsv.gz -rwxrwxrwx 1 160124996 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaw.tsv.gz -rwxrwxrwx 1 159158447 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xax.tsv.gz -rwxrwxrwx 1 158436630 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xay.tsv.gz -rwxrwxrwx 1 158518938 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaz.tsv.gz -rwxrwxrwx 1 156520868 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xba.tsv.gz -rwxrwxrwx 1 154253795 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbb.tsv.gz -rwxrwxrwx 1 142244585 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbc.tsv.gz Trying to run something as simple as a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using PigStorage(); s = sample a 0.001; dump s; gives org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) java.io.EOFException -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Daniel Dai <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Ayon Sinha <[EMAIL PROTECTED]> Sent: Friday, December 2, 2011 1:06 AM Subject: Re: Trying to submit Pig job to Amazon EMR Pig should support this syntax. Do you share your s3 data to public? Otherwise do you have fs.s3.awsAccessKeyId/fs.s3.awsSecretAccessKey defined? Daniel On Thu, Dec 1, 2011 at 4:27 PM, Ayon Sinha <[EMAIL PROTECTED]> wrote: Well, I should not need Pig to connect to HDFS. Its should use S3, so I changed fs.default.name to org.jets3t.service.impl.rest.httpclient.RestS3Service - Response '/user%2Fmymapred-user_%24folder%24' - Unexpected response code 404, expected 200 Jonathan Coveney <[EMAIL PROTECTED]> configure the NN & JT such that the job goes to the EMR cluster I've spun up. /home/mashlogic/ayon/pigconf/pig_1322784657959.log org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) org.apache.hadoop.fs.FileSystem.get(FileSystem.java:228)
-
Re: Trying to submit Pig job to Amazon EMRAyon Sinha 2011-12-05, 19:08
Looks like I'm running into a problem I hadn't seen before.
Pig is 9.1. Hadoop is the same version as on EMR. The conf is being picked up so that it connects to the EMR NN and JT. Now I get this: /home/mashlogic/ayon/hadoop-0.20.0 2011-12-05 10:56:58,200 [main] INFO org.apache.pig.Main - Logging error messages to: /home/mashlogic/ayon/pig_1323111418198.log 2011-12-05 10:56:58,398 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: 10.203.6.84:9000 2011-12-05 10:56:58,402 [main] WARN org.apache.hadoop.fs.FileSystem - "10.203.6.84:9000" is a deprecated filesystem name. Use "hdfs://10.203.6.84:9000/" instead. 2011-12-05 10:56:58,531 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to map-reduce job tracker at: 10.203.6.84:9001 2011-12-05 10:56:58,532 [main] WARN org.apache.hadoop.fs.FileSystem - "10.203.6.84:9000" is a deprecated filesystem name. Use "hdfs://10.203.6.84:9000/" instead. grunt> a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130' using PigStorage(); 2011-12-05 10:57:18,078 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: <line 1, column 4> pig script failed to validate: java.net.URISyntaxException: Illegal character in scheme name at index 0: 10.203.6.84:9000 What is going on here? -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Ayon Sinha <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Friday, December 2, 2011 8:01 PM Subject: Re: Trying to submit Pig job to Amazon EMR So with the help of Daniel and Thejas, we figured out the problem. The root cause was the mismatch of Hadoop versions between EMR and the Pig client. When I copied over all the hadoop jars from the EMR box to the EC2 Pig 0.8.1 client EC2 box, it still did not resolve the issue. The root cause of that was that, Pig 0.8.1 uses hadoop classes from within its own packaged jar. Version 0.9 has pigwithouthadoop jar so we used that. Also, the bin/pig script has a bug that resets HADOOP_HOME. The script was also patched to fix this. Then also Pig will look for /user/<username> directory in the HDFS of the EMR cluster. So one way is to create the directory in the HDFS and then let Pig do its job. I'm not sure why Pig can't create that directory if its doesn't exist. Will investigate that. Thanks to Daniel & Thejas once again. -Ayon See My Photos on Flickr Also check out my Blog for answers to commonly asked questions. ________________________________ From: Ayon Sinha <[EMAIL PROTECTED]> To: Daniel Dai <[EMAIL PROTECTED]>; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Friday, December 2, 2011 8:15 AM Subject: Re: Trying to submit Pig job to Amazon EMR Yes, I do that the awsSecretAccessKey defined, correct, I believe. To test: mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/ Found 29 items -rwxrwxrwx 1 139148530 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaa.tsv.gz -rwxrwxrwx 1 138086136 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xab.tsv.gz -rwxrwxrwx 1 146165298 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xac.tsv.gz -rwxrwxrwx 1 152491197 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xad.tsv.gz -rwxrwxrwx 1 154673351 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xae.tsv.gz -rwxrwxrwx 1 155920643 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xaf.tsv.gz -rwxrwxrwx 1 156468098 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xag.tsv.gz -rwxrwxrwx 1 157626894 2011-12-01 07:03 /smartlinks/daytsvs/day=20111130/xah.tsv.gz -rwxrwxrwx 1 158872953 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xai.tsv.gz -rwxrwxrwx 1 158108620 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xaj.tsv.gz -rwxrwxrwx 1 158439002 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xak.tsv.gz -rwxrwxrwx 1 158618811 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xal.tsv.gz -rwxrwxrwx 1 159421273 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xam.tsv.gz -rwxrwxrwx 1 158402981 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xan.tsv.gz -rwxrwxrwx 1 157375232 2011-12-01 07:04 /smartlinks/daytsvs/day=20111130/xao.tsv.gz -rwxrwxrwx 1 158516929 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xap.tsv.gz -rwxrwxrwx 1 158029022 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xaq.tsv.gz -rwxrwxrwx 1 159808270 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xar.tsv.gz -rwxrwxrwx 1 160148777 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xas.tsv.gz -rwxrwxrwx 1 160844640 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xat.tsv.gz -rwxrwxrwx 1 161679424 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xau.tsv.gz -rwxrwxrwx 1 159240120 2011-12-01 07:05 /smartlinks/daytsvs/day=20111130/xav.tsv.gz -rwxrwxrwx 1 160124996 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaw.tsv.gz -rwxrwxrwx 1 159158447 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xax.tsv.gz -rwxrwxrwx 1 158436630 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xay.tsv.gz -rwxrwxrwx 1 158518938 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xaz.tsv.gz -rwxrwxrwx 1 156520868 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xba.tsv.gz -rwxrwxrwx 1 154253795 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbb.tsv.gz -rwxrwxrwx 1 142244585 2011-12-01 07:06 /smartlinks/daytsvs/day=20111130/xbc.tsv.gz Trying to run something as simple as a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130/' using PigStorage(); s = sample a 0.001; dump s; gives org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) org.apache.hadoop.ipc.RPC.getProxy(RPC.java:384) java.io.EOFException -Ayon See My Photos on Fl
-
Re: Trying to submit Pig job to Amazon EMRThejas Nair 2011-12-05, 20:17
Can you send the entire stack trace from pig logs ?
-Thejas On 12/5/11 11:08 AM, Ayon Sinha wrote: > Looks like I'm running into a problem I hadn't seen before. > Pig is 9.1. Hadoop is the same version as on EMR. The conf is being > picked up so that it connects to the EMR NN and JT. Now I get this: > > /home/mashlogic/ayon/hadoop-0.20.0 > 2011-12-05 10:56:58,200 [main] INFO org.apache.pig.Main - Logging error > messages to: /home/mashlogic/ayon/pig_1323111418198.log > 2011-12-05 10:56:58,398 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to hadoop file system at: 10.203.6.84:9000 > 2011-12-05 10:56:58,402 [main] WARN org.apache.hadoop.fs.FileSystem - > "10.203.6.84:9000" is a deprecated filesystem name. Use > "hdfs://10.203.6.84:9000/" instead. > 2011-12-05 10:56:58,531 [main] INFO > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - > Connecting to map-reduce job tracker at: 10.203.6.84:9001 > 2011-12-05 10:56:58,532 [main] WARN org.apache.hadoop.fs.FileSystem - > "10.203.6.84:9000" is a deprecated filesystem name. Use > "hdfs://10.203.6.84:9000/" instead. > grunt> *a = load 's3n://ml-weblogs/smartlinks/daytsvs/day=20111130' > using PigStorage();* > 2011-12-05 10:57:18,078 [main] ERROR org.apache.pig.tools.grunt.Grunt - > ERROR 1200: Pig script failed to parse: > <line 1, column 4> pig script failed to validate: > java.net.URISyntaxException: Illegal character in scheme name at index > 0: 10.203.6.84:9000 > > What is going on here? > -Ayon > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> > Also check out my Blog for answers to commonly asked questions. > <http://dailyadvisor.blogspot.com> > > ------------------------------------------------------------------------ > *From:* Ayon Sinha <[EMAIL PROTECTED]> > *To:* "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > *Sent:* Friday, December 2, 2011 8:01 PM > *Subject:* Re: Trying to submit Pig job to Amazon EMR > > So with the help of Daniel and Thejas, we figured out the problem. The > root cause was the mismatch of Hadoop versions between EMR and the Pig > client. When I copied over all the hadoop jars from the EMR box to the > EC2 Pig 0.8.1 client EC2 box, it still did not resolve the issue. The > root cause of that was that, > Pig 0.8.1 uses hadoop classes from within its own packaged jar. Version > 0.9 has pigwithouthadoop jar so we used that. > > Also, the bin/pig script has a bug that resets HADOOP_HOME. The script > was also patched to fix this. > > Then also Pig will look for /user/<username> directory in the HDFS of > the EMR cluster. So one way is to create the directory in the HDFS and > then let Pig do its job. I'm not sure why Pig can't create that > directory if its doesn't exist. Will investigate that. > > Thanks to Daniel & Thejas once again. > > -Ayon > See My Photos on Flickr > Also check out my Blog for answers to commonly asked questions. > > > > ________________________________ > From: Ayon Sinha <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > To: Daniel Dai <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>; > "[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>" <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> > Sent: Friday, December 2, 2011 8:15 AM > Subject: Re: Trying to submit Pig job to Amazon EMR > > Yes, I do that the awsSecretAccessKey defined, correct, I believe. > To test: > > mashlogic@cruncher ~ [ 8:07AM] hadoop dfs -ls > s3n://ml-weblogs/smartlinks/daytsvs/day=20111130/ > Found 29 items > -rwxrwxrwx 1 139148530 2011-12-01 07:03 > /smartlinks/daytsvs/day=20111130/xaa.tsv.gz > -rwxrwxrwx 1 138086136 2011-12-01 07:03 > /smartlinks/daytsvs/day=20111130/xab.tsv.gz > -rwxrwxrwx 1 146165298 2011-12-01 07:03 > /smartlinks/daytsvs/day=20111130/xac.tsv.gz > -rwxrwxrwx 1 152491197 2011-12-01 07:03 > /smartlinks/daytsvs/day=20111130/xad.tsv.gz > -rwxrwxrwx 1 154673351 2011-12-01 07:03 > /smartlinks/daytsvs/day=20111130/xae.tsv.gz > -rwxrwxrwx 1 155920643 2011-12-01 07 |