Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> use S3 as input to MR job


Copy link to this message
-
Re: use S3 as input to MR job
Are you sure that you prepare your MR code to work with mutiple files?
This example (WordCount) works with a single input.

You should take a look to the MultipleInput API for this.
Best wishes

El 02/10/2012 6:05, Ben Kim escribi�:
> I'm having a similar issue
>
> I'm running a wordcount MR as follows
>
>     hadoop jar WordCount.jar wordcount.WordCountDriver
>     s3n://bucket/wordcount/input s3n://bucket/wordcount/output
>
> s3n://bucket/wordcount/input is a s3 object that contains other input
> files.
>
> However I get following NPE error
>
>     12/10/02 18:56:23 INFO mapred.JobClient:  map 0% reduce 0%
>     12/10/02 18:56:54 INFO mapred.JobClient:  map 50% reduce 0%
>             12/10/02 18:56:56 INFO mapred.JobClient: Task Id :
>     attempt_201210021853_0001_m_000001_0, Status : FAILED
>     java.lang.NullPointerException
>             at
>     org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
>             at
>     java.io.BufferedInputStream.close(BufferedInputStream.java:451)
>             at java.io.FilterInputStream.close(FilterInputStream.java:155)
>             at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
>             at
>     org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144)
>             at
>     org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497)
>             at
>     org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>             at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>             at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>             at java.security.AccessController.doPrivileged(Native Method)
>             at javax.security.auth.Subject.doAs(Subject.java:396)
>             at
>     org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>             at org.apache.hadoop.mapred.Child.main(Child.java:249)
>
>
> MR runs fine if i specify more specific input path such as
> s3n://bucket/wordcount/input/file.txt
> what i want is to be able to pass s3 folders as parameters
> Does anyone knows how to do this?
>
> Best regards,
> Ben Kim
>
>
> On Fri, Jul 20, 2012 at 10:33 AM, Harsh J <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     Dan,
>
>     Can you share your error? The plain .gz files (not .tar.gz) are
>     natively supported by Hadoop via its GzipCodec, and if you are
>     facing an error, I believe its cause of something other than
>     compression.
>
>
>     On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi <[EMAIL PROTECTED]
>     <mailto:[EMAIL PROTECTED]>> wrote:
>
>         i have a MR job to read file on amazon S3 and process the data
>         on local hdfs. the files are zipped text file as .gz. i tried
>         to setup the job as below but it won't work, anyone know what
>         might be wrong? do i need to add extra step to unzip the file
>         first? thanks.
>
>         |String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
>
>         protected void prepareHadoopJob() throws Exception {
>
>              this.getHadoopJob().setMapperClass(Mapper1.class);
>              this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
>
>              FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
>
>              this.getHadoopJob().setNumReduceTasks(0);
>              this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
>              this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
>              this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
>              this.getHadoopJob().setOutputValueClass(Put.class);
>         }|
>
>
>
>
>         *
>         **
>         Dan Yi*| Software Engineer, Analytics Engineering
>         Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
>         */Predictive Analytics for a Connected World/*

Marcos Ortiz Valmaseda,
Data Engineer && Senior System Administrator at UCI
Blog: http://marcosluis2186.posterous.com
Linkedin: http://www.linkedin.com/in/marcosluis2186
Twitter: @marcosluis2186

10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS CIENCIAS INFORMATICAS...
CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION

http://www.uci.cu
http://www.facebook.com/universidad.uci
http://www.flickr.com/photos/universidad_uci
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB