Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> use S3 as input to MR job


Copy link to this message
-
Re: use S3 as input to MR job
I'm having a similar issue

I'm running a wordcount MR as follows

hadoop jar WordCount.jar wordcount.WordCountDriver
> s3n://bucket/wordcount/input s3n://bucket/wordcount/output
s3n://bucket/wordcount/input is a s3 object that contains other input files.

However I get following NPE error

12/10/02 18:56:23 INFO mapred.JobClient:  map 0% reduce 0%
> 12/10/02 18:56:54 INFO mapred.JobClient:  map 50% reduce 0%
>         12/10/02 18:56:56 INFO mapred.JobClient: Task Id :
> attempt_201210021853_0001_m_000001_0, Status : FAILED
> java.lang.NullPointerException
>         at
> org.apache.hadoop.fs.s3native.NativeS3FileSystem$NativeS3FsInputStream.close(NativeS3FileSystem.java:106)
>         at java.io.BufferedInputStream.close(BufferedInputStream.java:451)
>         at java.io.FilterInputStream.close(FilterInputStream.java:155)
>         at org.apache.hadoop.util.LineReader.close(LineReader.java:83)
>         at
> org.apache.hadoop.mapreduce.lib.input.LineRecordReader.close(LineRecordReader.java:144)
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.close(MapTask.java:497)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
MR runs fine if i specify more specific input path such as
s3n://bucket/wordcount/input/file.txt
what i want is to be able to pass s3 folders as parameters
Does anyone knows how to do this?

Best regards,
Ben Kim
On Fri, Jul 20, 2012 at 10:33 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Dan,
>
> Can you share your error? The plain .gz files (not .tar.gz) are natively
> supported by Hadoop via its GzipCodec, and if you are facing an error, I
> believe its cause of something other than compression.
>
>
> On Fri, Jul 20, 2012 at 6:14 AM, Dan Yi <[EMAIL PROTECTED]> wrote:
>
>> i have a MR job to read file on amazon S3 and process the data on local
>> hdfs. the files are zipped text file as .gz. i tried to setup the job as
>> below but it won't work, anyone know what might be wrong? do i need to add
>> extra step to unzip the file first? thanks.
>>
>> String S3_LOCATION = "s3n://access_key:private_key@bucket_name"
>>
>> protected void prepareHadoopJob() throws Exception {
>>
>>     this.getHadoopJob().setMapperClass(Mapper1.class);
>>     this.getHadoopJob().setInputFormatClass(TextInputFormat.class);
>>
>>     FileInputFormat.addInputPath(this.getHadoopJob(), new Path(S3_LOCATION));
>>
>>     this.getHadoopJob().setNumReduceTasks(0);
>>     this.getHadoopJob().setOutputFormatClass(TableOutputFormat.class);
>>     this.getHadoopJob().getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, myTable.getTableName());
>>     this.getHadoopJob().setOutputKeyClass(ImmutableBytesWritable.class);
>>     this.getHadoopJob().setOutputValueClass(Put.class);
>> }
>>
>>
>>
>>
>> *
>>
>> Dan Yi | Software Engineer, Analytics Engineering
>>   Medio Systems Inc | 701 Pike St. #1500 Seattle, WA 98101
>> Predictive Analytics for a Connected World
>>  *
>>
>>
>
>
> --
> Harsh J
>

--

*Benjamin Kim*
*benkimkimben at gmail*
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB