|
Dave Viner
2010-06-14, 02:36
Ashutosh Chauhan
2010-06-14, 06:09
Dave Viner
2010-06-14, 14:00
jr
2010-06-14, 14:28
Dan Di Spaltro
2010-06-14, 15:39
Dave Viner
2010-06-14, 15:46
|
-
using s3 as a data sourceDave Viner 2010-06-14, 02:36
I'm having trouble using S3 as a data source for files in the LOAD
statement. From research, it definitely appears that I want s3n://, not s3:// because the file was placed there by another (non-hadoop/pig) process. So, here's the basic step: LOGS = LOAD 's3n://my-key:my-skey@/log/file/path/2010.04.13.20:05:04.log.bz2' USING PigStorage('\t') dump LOGS; I get this grunt error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: s3n://my-key:my-skey@ /log/file/path/2010.04.13.20:05:04.log.bz2 Is there some other way I can/should specify a file from S3 as the source of a LOAD statement? Thanks Dave Viner
-
Re: using s3 as a data sourceAshutosh Chauhan 2010-06-14, 06:09
Dave,
A log file must be sitting in your dir from where you are running Pig. It will contain the stack trace for the failure. Can you paste the content of the log file here. Ashutosh On Sun, Jun 13, 2010 at 19:36, Dave Viner <[EMAIL PROTECTED]> wrote: > I'm having trouble using S3 as a data source for files in the LOAD > statement. From research, it definitely appears that I want s3n://, not > s3:// because the file was placed there by another (non-hadoop/pig) process. > So, here's the basic step: > > LOGS = LOAD 's3n://my-key:my-skey@/log/file/path/2010.04.13.20:05:04.log.bz2' > USING PigStorage('\t') > dump LOGS; > > I get this grunt error: > > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to > create input splits for: s3n://my-key:my-skey@ > /log/file/path/2010.04.13.20:05:04.log.bz2 > > > Is there some other way I can/should specify a file from S3 as the source of > a LOAD statement? > > Thanks > Dave Viner >
-
Re: using s3 as a data sourceDave Viner 2010-06-14, 14:00
Here's the stack trace related to that error:
Pig Stack Trace --------------- ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: s3n://my-key:my-skey@ /log/file/path/2010.04.13.20:05:04.log.bz2 org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias LOGS at org.apache.pig.PigServer.openIterator(PigServer.java:521) at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) at org.apache.pig.Main.main(Main.java:357) Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backend error: org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to create input splits for: s3n://my-key:my-skey@ /log/file/path/2010.04.13.20:05:04.log.bz2 at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:268) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835) at org.apache.pig.PigServer.store(PigServer.java:569) at org.apache.pig.PigServer.openIterator(PigServer.java:504) ... 6 more After much more experimentation, I discovered that if I copy the file locally before executing Pig, the script works properly. That is, I ran: % /usr/local/hadoop/bin/hadoop dfs -copyToLocal "s3n:///log/file/path/2010-04-13-20-05-04.log.bz2" test.bz2 Then in pig, read in the file using: logstest2 = load 'test.bz2' USING PigStorage('\t'); and it worked fine. One additional problem I discovered, at least for hdfs, is that dfs -copyToLocal does not work for a file with a ':' in the name. When I replaced the ':' with '-', it worked fine. However, even using the '-' filename, Pig would not open the remote file. Dave Viner On Sun, Jun 13, 2010 at 11:09 PM, Ashutosh Chauhan < [EMAIL PROTECTED]> wrote: > Dave, > > A log file must be sitting in your dir from where you are running Pig. > It will contain the stack trace for the failure. Can you paste the > content of the log file here. > > Ashutosh > On Sun, Jun 13, 2010 at 19:36, Dave Viner <[EMAIL PROTECTED]> wrote: > > I'm having trouble using S3 as a data source for files in the LOAD > > statement. From research, it definitely appears that I want s3n://, not > > s3:// because the file was placed there by another (non-hadoop/pig) > process. > > So, here's the basic step: > > > > LOGS = LOAD 's3n://my-key:my-skey@ > /log/file/path/2010.04.13.20:05:04.log.bz2' > > USING PigStorage('\t') > > dump LOGS; > > > > I get this grunt error: > > > > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable > to > > create input splits for: s3n://my-key:my-skey@ > > /log/file/path/2010.04.13.20:05:04.log.bz2 > > > > > > Is there some other way I can/should specify a file from S3 as the source > of > > a LOAD statement? > > > > Thanks > > Dave Viner > > >
-
Re: using s3 as a data sourcejr 2010-06-14, 14:28
I think i ran into the same kind of error, pig working on s3n directly
for loading didn't work. i've switched to running a distcp from s3n to HDFS prior to the pig job and then load the data from HDFS. Johannes
-
Re: using s3 as a data sourceDan Di Spaltro 2010-06-14, 15:39
aren't you missing the bucket name?
On Mon, Jun 14, 2010 at 7:00 AM, Dave Viner <[EMAIL PROTECTED]> wrote: > Here's the stack trace related to that error: > > Pig Stack Trace > --------------- > ERROR 2997: Unable to recreate exception from backend error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to > create input splits for: s3n://my-key:my-skey@ > /log/file/path/2010.04.13.20:05:04.log.bz2 > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to > open iterator for alias LOGS > at org.apache.pig.PigServer.openIterator(PigServer.java:521) > at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) > at > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) > at org.apache.pig.Main.main(Main.java:357) > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: > Unable to recreate exception from backend error: > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable to > create input splits for: s3n://my-key:my-skey@ > /log/file/path/2010.04.13.20:05:04.log.bz2 > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169) > at > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:268) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) > at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835) > at org.apache.pig.PigServer.store(PigServer.java:569) > at org.apache.pig.PigServer.openIterator(PigServer.java:504) > ... 6 more > > After much more experimentation, I discovered that if I copy the file > locally before executing Pig, the script works properly. That is, I ran: > > % /usr/local/hadoop/bin/hadoop dfs -copyToLocal > "s3n:///log/file/path/2010-04-13-20-05-04.log.bz2" test.bz2 > > Then in pig, read in the file using: > logstest2 = load 'test.bz2' USING PigStorage('\t'); > > and it worked fine. > > One additional problem I discovered, at least for hdfs, is that dfs > -copyToLocal does not work for a file with a ':' in the name. When I > replaced the ':' with '-', it worked fine. > However, even using the '-' filename, Pig would not open the remote file. > > Dave Viner > > On Sun, Jun 13, 2010 at 11:09 PM, Ashutosh Chauhan < > [EMAIL PROTECTED]> wrote: > >> Dave, >> >> A log file must be sitting in your dir from where you are running Pig. >> It will contain the stack trace for the failure. Can you paste the >> content of the log file here. >> >> Ashutosh >> On Sun, Jun 13, 2010 at 19:36, Dave Viner <[EMAIL PROTECTED]> wrote: >> > I'm having trouble using S3 as a data source for files in the LOAD >> > statement. From research, it definitely appears that I want s3n://, not >> > s3:// because the file was placed there by another (non-hadoop/pig) >> process. >> > So, here's the basic step: >> > >> > LOGS = LOAD 's3n://my-key:my-skey@ >> /log/file/path/2010.04.13.20:05:04.log.bz2' >> > USING PigStorage('\t') >> > dump LOGS; >> > >> > I get this grunt error: >> > >> > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable >> to >> > create input splits for: s3n://my-key:my-skey@ >> > /log/file/path/2010.04.13.20:05:04.log.bz2 >> > >> > >> > Is there some other way I can/should specify a file from S3 as the source >> of >> > a LOAD statement? >> > >> > Thanks >> > Dave Viner >> > >> > -- Dan Di Spaltro
-
Re: using s3 as a data sourceDave Viner 2010-06-14, 15:46
I have redacted the exact path, since i don't want to publish it on a
newsgroup. But, here's how I actually made the s3n URI: Go to S3Fox and look up the file I want to test, and extract the full HTTP URL. This is something like: http://log.s3.amazonaws.com/file/path/2010.04.13.20:05:04.log.bz2 In this example 'log' is the name of my bucket. Then, I replace the http:// with s3n://. Then I remove the '. s3.amazonaws.com' from the string. That results in s3n://log/file/path/2010.04.13.20:05:04.log.bz2 Then, I add in the key and secret key: s3n://my-key:my-skey@/log/file/path/2010.04.13.20:05:04.log.bz2 Let me know if there's some other way to form the s3n URI. Dave Viner On Mon, Jun 14, 2010 at 8:39 AM, Dan Di Spaltro <[EMAIL PROTECTED]>wrote: > aren't you missing the bucket name? > > On Mon, Jun 14, 2010 at 7:00 AM, Dave Viner <[EMAIL PROTECTED]> wrote: > > Here's the stack trace related to that error: > > > > Pig Stack Trace > > --------------- > > ERROR 2997: Unable to recreate exception from backend error: > > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable > to > > create input splits for: s3n://my-key:my-skey@ > > /log/file/path/2010.04.13.20:05:04.log.bz2 > > > > org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to > > open iterator for alias LOGS > > at org.apache.pig.PigServer.openIterator(PigServer.java:521) > > at > org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:544) > > at > > > org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:241) > > at > > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:162) > > at > > > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:138) > > at org.apache.pig.tools.grunt.Grunt.run(Grunt.java:75) > > at org.apache.pig.Main.main(Main.java:357) > > Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR > 2997: > > Unable to recreate exception from backend error: > > org.apache.pig.backend.executionengine.ExecException: ERROR 2118: Unable > to > > create input splits for: s3n://my-key:my-skey@ > > /log/file/path/2010.04.13.20:05:04.log.bz2 > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:169) > > at > > > org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:268) > > at > > > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:308) > > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:835) > > at org.apache.pig.PigServer.store(PigServer.java:569) > > at org.apache.pig.PigServer.openIterator(PigServer.java:504) > > ... 6 more > > > > After much more experimentation, I discovered that if I copy the file > > locally before executing Pig, the script works properly. That is, I ran: > > > > % /usr/local/hadoop/bin/hadoop dfs -copyToLocal > > "s3n:///log/file/path/2010-04-13-20-05-04.log.bz2" test.bz2 > > > > Then in pig, read in the file using: > > logstest2 = load 'test.bz2' USING PigStorage('\t'); > > > > and it worked fine. > > > > One additional problem I discovered, at least for hdfs, is that dfs > > -copyToLocal does not work for a file with a ':' in the name. When I > > replaced the ':' with '-', it worked fine. > > However, even using the '-' filename, Pig would not open the remote file. > > > > Dave Viner > > > > On Sun, Jun 13, 2010 at 11:09 PM, Ashutosh Chauhan < > > [EMAIL PROTECTED]> wrote: > > > >> Dave, > >> > >> A log file must be sitting in your dir from where you are running Pig. > >> It will contain the stack trace for the failure. Can you paste the > >> content of the log file here. > >> > >> Ashutosh > >> On Sun, Jun 13, 2010 at 19:36, Dave Viner <[EMAIL PROTECTED]> wrote: > >> > I'm having trouble using S3 as a data source for files in the LOAD > >> > statement. From research, it definitely appears that I want s3n://, |