Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> why does my mapper class reads my input file twice?


Copy link to this message
-
Re: why does my mapper class reads my input file twice?
Harsh,

Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and
it is as you stated. That make sense now. I simply commented out
FileInputFormat.addInputPath(job, input)
 and FileOutputFormat.setOutputPath(job, output) and everything
automagically works now.

Thanks a bunch!

On Tue, Mar 6, 2012 at 2:06 AM, Harsh J <[EMAIL PROTECTED]> wrote:

> Its your use of the mapred.input.dir property, which is a reserved
> name in the framework (its what FileInputFormat uses).
>
> You have a config you extract path from:
> Path input = new Path(conf.get("mapred.input.dir"));
>
> Then you do:
> FileInputFormat.addInputPath(job, input);
>
> Which internally, simply appends a path to a config prop called
> "mapred.input.dir". Hence your job gets launched with two input files
> (the very same) - one added by default Tool-provided configuration
> (cause of your -Dmapred.input.dir) and the other added by you.
>
> Fix the input path line to use a different config:
> Path input = new Path(conf.get("input.path"));
>
> And run job as:
> hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt
> -Dmapred.output.dir=result
>
> On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne <[EMAIL PROTECTED]>
> wrote:
> > i have code that reads in a text file. i notice that each line in the
> text
> > file is somehow being read twice. why is this happening?
> >
> > my mapper class looks like the following:
> >
> > public class MyMapper extends Mapper<LongWritable, Text, LongWritable,
> > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyMapper.class);
> >  @Override
> > public void map(LongWritable key, Text value, Context context) throws
> > IOException, InterruptedException {
> > String s = (new
> > StringBuilder()).append(value.toString()).append("m").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> >
> > my reducer class looks like the following:
> >
> > public class MyReducer extends Reducer<LongWritable, Text, LongWritable,
> > Text> {
> >
> > private static final Log _log = LogFactory.getLog(MyReducer.class);
> >  @Override
> > public void reduce(LongWritable key, Iterable<Text> values, Context
> > context) throws IOException, InterruptedException {
> > for(Iterator<Text> it = values.iterator(); it.hasNext();) {
> > Text txt = it.next();
> > String s = (new
> > StringBuilder()).append(txt.toString()).append("r").toString();
> > context.write(key, new Text(s));
> > _log.debug(key.toString() + " => " + s);
> > }
> > }
> > }
> >
> > my job class looks like the following:
> >
> > public class MyJob extends Configured implements Tool {
> >
> > public static void main(String[] args) throws Exception {
> > ToolRunner.run(new Configuration(), new MyJob(), args);
> > }
> >
> > @Override
> > public int run(String[] args) throws Exception {
> > Configuration conf = getConf();
> > Path input = new Path(conf.get("mapred.input.dir"));
> >    Path output = new Path(conf.get("mapred.output.dir"));
> >
> >    Job job = new Job(conf, "dummy job");
> >    job.setMapOutputKeyClass(LongWritable.class);
> >    job.setMapOutputValueClass(Text.class);
> >    job.setOutputKeyClass(LongWritable.class);
> >    job.setOutputValueClass(Text.class);
> >
> >    job.setMapperClass(MyMapper.class);
> >    job.setReducerClass(MyReducer.class);
> >
> >    FileInputFormat.addInputPath(job, input);
> >    FileOutputFormat.setOutputPath(job, output);
> >
> >    job.setJarByClass(MyJob.class);
> >
> >    return job.waitForCompletion(true) ? 0 : 1;
> > }
> > }
> >
> > the text file that i am trying to read in looks like the following. as
> you
> > can see, there are 9 lines.
> >
> > T, T
> > T, T
> > T, T
> > F, F
> > F, F
> > F, F
> > F, F
> > T, F
> > F, T
> >
> > the output file that i get after my Job runs looks like the following. as
> > you can see, there are 18 lines. each key is emitted twice from the
> mapper
> > to the reducer.
> >
> > 0   T, Tmr
> > 0   T, Tmr
> > 6   T, Tmr
> > 6   T, Tmr
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB