|
|
-
Re: use S3 as input to MR jobMarcos Ortiz 2012-10-02, 13:07
Are you sure that you prepare your MR code to work with mutiple files?
This example (WordCount) works with a single input. You should take a look to the MultipleInput API for this. Best wishes El 02/10/2012 6:05, Ben Kim escribi�: > I'm having a similar issue > > I'm running a wordcount MR as follows > > hadoop jar WordCount.jar wordcount.WordCountDriver > s3n://bucket/wordcount/input s3n://bucket/wordcount/output > > s3n://bucket/wordcount/input is a s3 object that contains other input > files. > > However I get following NPE error > > 12/10/02 18:56:23 INFO mapred.JobClient: map 0% reduce 0% > 12/10/02 18:56:54 INFO mapred.JobClient: map 50% reduce 0% > 12/10/02 18:56:56 INFO mapred.JobClient: Task Id : > attempt_201210021853_0001_m_000001_0, Status : FAILED > java.lang.NullPointerException > at > org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106) > at > java.io.BufferedInputStream.close(BufferedInputStream.java:451) > at java.io.FilterInputStream.close(FilterInputStream.java:155) > at org.apache.hadoop.util.LineReader.close(LineReader.java:83) > at > org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497) > at > org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at org.apache.hadoop.mapred.Child$4.run(Child.java:255) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) > at org.apache.hadoop.mapred.Child.main(Child.java:249) > > > MR runs fine if i specify more specific input path such as > s3n://bucket/wordcount/input/file.txt > what i want is to be able to pass s3 folders as parameters > Does anyone knows how to do this? > > Best regards, > Ben Kim > > > On Fri, Jul 20, 2012 at 10:33 AM, Harsh J <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > Dan, > > Can you share your error? The plain .gz files (not .tar.gz) are > natively supported by Hadoop via its GzipCodec, and if you are > facing an error, I believe its cause of something other than > compression. > > > On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > i have a MR job to read file on amazon S3 and process the data > on local hdfs. the files are zipped text file as .gz. i tried > to setup the job as below but it won't work, anyone know what > might be wrong? do i need to add extra step to unzip the file > first? thanks. > > |String S3_LOCATION = "s3n://access_key:private_key@bucket_name" > > protected void prepareHadoopJob() throws Exception { > > this.getHadoopJob().setMapperClass(Mapper1.class); > this.getHadoopJob().setInputFormatClass(TextInputFormat.class); > > FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION)); > > this.getHadoopJob().setNumReduceTasks(0); > this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class); > this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName()); > this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class); > this.getHadoopJob().setOutputValueClass(Put.class); > }| > > > > > * > ** > Dan Yi*| Software Engineer, Analytics Engineering > Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101 > */Predictive Analytics for a Connected World/* Marcos Ortiz Valmaseda, Data Engineer && Senior System Administrator at UCI Blog: http://marcosluis2186.posterous.com Linkedin: http://www.linkedin.com/in/marcosluis2186 Twitter: @marcosluis2186 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS... CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION http://www.uci.cu http://www.facebook.com/universidad.uci http://www.flickr.com/photos/universidad_uci |