Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - why does my mapper class reads my input file twice?


Copy link to this message
-
why does my mapper class reads my input file twice?
Jane Wayne 2012-03-06, 03:33
i have code that reads in a text file. i notice that each line in the text
file is somehow being read twice. why is this happening?

my mapper class looks like the following:

public class MyMapper extends Mapper<LongWritable, Text, LongWritable,
Text> {

private static final Log _log = LogFactory.getLog(MyMapper.class);
 @Override
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String s = (new
StringBuilder()).append(value.toString()).append("m").toString();
context.write(key, new Text(s));
_log.debug(key.toString() + " => " + s);
}
}

my reducer class looks like the following:

public class MyReducer extends Reducer<LongWritable, Text, LongWritable,
Text> {

private static final Log _log = LogFactory.getLog(MyReducer.class);
 @Override
public void reduce(LongWritable key, Iterable<Text> values, Context
context) throws IOException, InterruptedException {
for(Iterator<Text> it = values.iterator(); it.hasNext();) {
Text txt = it.next();
String s = (new
StringBuilder()).append(txt.toString()).append("r").toString();
context.write(key, new Text(s));
_log.debug(key.toString() + " => " + s);
}
}
}

my job class looks like the following:

public class MyJob extends Configured implements Tool {

public static void main(String[] args) throws Exception {
ToolRunner.run(new Configuration(), new MyJob(), args);
}

@Override
public int run(String[] args) throws Exception {
Configuration conf = getConf();
Path input = new Path(conf.get("mapred.input.dir"));
    Path output = new Path(conf.get("mapred.output.dir"));

    Job job = new Job(conf, "dummy job");
    job.setMapOutputKeyClass(LongWritable.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(LongWritable.class);
    job.setOutputValueClass(Text.class);

    job.setMapperClass(MyMapper.class);
    job.setReducerClass(MyReducer.class);

    FileInputFormat.addInputPath(job, input);
    FileOutputFormat.setOutputPath(job, output);

    job.setJarByClass(MyJob.class);

    return job.waitForCompletion(true) ? 0 : 1;
}
}

the text file that i am trying to read in looks like the following. as you
can see, there are 9 lines.

T, T
T, T
T, T
F, F
F, F
F, F
F, F
T, F
F, T

the output file that i get after my Job runs looks like the following. as
you can see, there are 18 lines. each key is emitted twice from the mapper
to the reducer.

0   T, Tmr
0   T, Tmr
6   T, Tmr
6   T, Tmr
12  T, Tmr
12  T, Tmr
18  F, Fmr
18  F, Fmr
24  F, Fmr
24  F, Fmr
30  F, Fmr
30  F, Fmr
36  F, Fmr
36  F, Fmr
42  T, Fmr
42  T, Fmr
48  F, Tmr
48  F, Tmr

the way i execute my Job is as follows (cygwin + hadoop 0.20.2).

hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt
-Dmapred.output.dir=result

originally, this happened when i read in a sequence file, but even for a
text file, this problem is still happening. is it the way i have setup my
Job?