|
|
-
why does my mapper class reads my input file twice?
Jane Wayne 2012-03-06, 03:33
i have code that reads in a text file. i notice that each line in the text file is somehow being read twice. why is this happening?
my mapper class looks like the following:
public class MyMapper extends Mapper<LongWritable, Text, LongWritable, Text> {
private static final Log _log = LogFactory.getLog(MyMapper.class); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String s = (new StringBuilder()).append(value.toString()).append("m").toString(); context.write(key, new Text(s)); _log.debug(key.toString() + " => " + s); } }
my reducer class looks like the following:
public class MyReducer extends Reducer<LongWritable, Text, LongWritable, Text> {
private static final Log _log = LogFactory.getLog(MyReducer.class); @Override public void reduce(LongWritable key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for(Iterator<Text> it = values.iterator(); it.hasNext();) { Text txt = it.next(); String s = (new StringBuilder()).append(txt.toString()).append("r").toString(); context.write(key, new Text(s)); _log.debug(key.toString() + " => " + s); } } }
my job class looks like the following:
public class MyJob extends Configured implements Tool {
public static void main(String[] args) throws Exception { ToolRunner.run(new Configuration(), new MyJob(), args); }
@Override public int run(String[] args) throws Exception { Configuration conf = getConf(); Path input = new Path(conf.get("mapred.input.dir")); Path output = new Path(conf.get("mapred.output.dir"));
Job job = new Job(conf, "dummy job"); job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(LongWritable.class); job.setOutputValueClass(Text.class);
job.setMapperClass(MyMapper.class); job.setReducerClass(MyReducer.class);
FileInputFormat.addInputPath(job, input); FileOutputFormat.setOutputPath(job, output);
job.setJarByClass(MyJob.class);
return job.waitForCompletion(true) ? 0 : 1; } }
the text file that i am trying to read in looks like the following. as you can see, there are 9 lines.
T, T T, T T, T F, F F, F F, F F, F T, F F, T
the output file that i get after my Job runs looks like the following. as you can see, there are 18 lines. each key is emitted twice from the mapper to the reducer.
0 T, Tmr 0 T, Tmr 6 T, Tmr 6 T, Tmr 12 T, Tmr 12 T, Tmr 18 F, Fmr 18 F, Fmr 24 F, Fmr 24 F, Fmr 30 F, Fmr 30 F, Fmr 36 F, Fmr 36 F, Fmr 42 T, Fmr 42 T, Fmr 48 F, Tmr 48 F, Tmr
the way i execute my Job is as follows (cygwin + hadoop 0.20.2).
hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt -Dmapred.output.dir=result
originally, this happened when i read in a sequence file, but even for a text file, this problem is still happening. is it the way i have setup my Job?
-
Re: why does my mapper class reads my input file twice?
Harsh J 2012-03-06, 07:06
Its your use of the mapred.input.dir property, which is a reserved name in the framework (its what FileInputFormat uses).
You have a config you extract path from: Path input = new Path(conf.get("mapred.input.dir"));
Then you do: FileInputFormat.addInputPath(job, input);
Which internally, simply appends a path to a config prop called "mapred.input.dir". Hence your job gets launched with two input files (the very same) - one added by default Tool-provided configuration (cause of your -Dmapred.input.dir) and the other added by you.
Fix the input path line to use a different config: Path input = new Path(conf.get("input.path"));
And run job as: hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt -Dmapred.output.dir=result
On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne <[EMAIL PROTECTED]> wrote: > i have code that reads in a text file. i notice that each line in the text > file is somehow being read twice. why is this happening? > > my mapper class looks like the following: > > public class MyMapper extends Mapper<LongWritable, Text, LongWritable, > Text> { > > private static final Log _log = LogFactory.getLog(MyMapper.class); > @Override > public void map(LongWritable key, Text value, Context context) throws > IOException, InterruptedException { > String s = (new > StringBuilder()).append(value.toString()).append("m").toString(); > context.write(key, new Text(s)); > _log.debug(key.toString() + " => " + s); > } > } > > my reducer class looks like the following: > > public class MyReducer extends Reducer<LongWritable, Text, LongWritable, > Text> { > > private static final Log _log = LogFactory.getLog(MyReducer.class); > @Override > public void reduce(LongWritable key, Iterable<Text> values, Context > context) throws IOException, InterruptedException { > for(Iterator<Text> it = values.iterator(); it.hasNext();) { > Text txt = it.next(); > String s = (new > StringBuilder()).append(txt.toString()).append("r").toString(); > context.write(key, new Text(s)); > _log.debug(key.toString() + " => " + s); > } > } > } > > my job class looks like the following: > > public class MyJob extends Configured implements Tool { > > public static void main(String[] args) throws Exception { > ToolRunner.run(new Configuration(), new MyJob(), args); > } > > @Override > public int run(String[] args) throws Exception { > Configuration conf = getConf(); > Path input = new Path(conf.get("mapred.input.dir")); > Path output = new Path(conf.get("mapred.output.dir")); > > Job job = new Job(conf, "dummy job"); > job.setMapOutputKeyClass(LongWritable.class); > job.setMapOutputValueClass(Text.class); > job.setOutputKeyClass(LongWritable.class); > job.setOutputValueClass(Text.class); > > job.setMapperClass(MyMapper.class); > job.setReducerClass(MyReducer.class); > > FileInputFormat.addInputPath(job, input); > FileOutputFormat.setOutputPath(job, output); > > job.setJarByClass(MyJob.class); > > return job.waitForCompletion(true) ? 0 : 1; > } > } > > the text file that i am trying to read in looks like the following. as you > can see, there are 9 lines. > > T, T > T, T > T, T > F, F > F, F > F, F > F, F > T, F > F, T > > the output file that i get after my Job runs looks like the following. as > you can see, there are 18 lines. each key is emitted twice from the mapper > to the reducer. > > 0 T, Tmr > 0 T, Tmr > 6 T, Tmr > 6 T, Tmr > 12 T, Tmr > 12 T, Tmr > 18 F, Fmr > 18 F, Fmr > 24 F, Fmr > 24 F, Fmr > 30 F, Fmr > 30 F, Fmr > 36 F, Fmr > 36 F, Fmr > 42 T, Fmr > 42 T, Fmr > 48 F, Tmr > 48 F, Tmr > > the way i execute my Job is as follows (cygwin + hadoop 0.20.2). > > hadoop jar dummy-0.1.jar dummy.MyJob -Dmapred.input.dir=data/dummy.txt > -Dmapred.output.dir=result > > originally, this happened when i read in a sequence file, but even for a > text file, this problem is still happening. is it the way i have setup my > Job?
-- Harsh J
-
Re: why does my mapper class reads my input file twice?
Jane Wayne 2012-03-06, 13:52
Harsh,
Thanks. I went into the code on FileInputFormat.addInputPath(Job,Path) and it is as you stated. That make sense now. I simply commented out FileInputFormat.addInputPath(job, input) and FileOutputFormat.setOutputPath(job, output) and everything automagically works now.
Thanks a bunch!
On Tue, Mar 6, 2012 at 2:06 AM, Harsh J <[EMAIL PROTECTED]> wrote:
> Its your use of the mapred.input.dir property, which is a reserved > name in the framework (its what FileInputFormat uses). > > You have a config you extract path from: > Path input = new Path(conf.get("mapred.input.dir")); > > Then you do: > FileInputFormat.addInputPath(job, input); > > Which internally, simply appends a path to a config prop called > "mapred.input.dir". Hence your job gets launched with two input files > (the very same) - one added by default Tool-provided configuration > (cause of your -Dmapred.input.dir) and the other added by you. > > Fix the input path line to use a different config: > Path input = new Path(conf.get("input.path")); > > And run job as: > hadoop jar dummy-0.1.jar dummy.MyJob -Dinput.path=data/dummy.txt > -Dmapred.output.dir=result > > On Tue, Mar 6, 2012 at 9:03 AM, Jane Wayne <[EMAIL PROTECTED]> > wrote: > > i have code that reads in a text file. i notice that each line in the > text > > file is somehow being read twice. why is this happening? > > > > my mapper class looks like the following: > > > > public class MyMapper extends Mapper<LongWritable, Text, LongWritable, > > Text> { > > > > private static final Log _log = LogFactory.getLog(MyMapper.class); > > @Override > > public void map(LongWritable key, Text value, Context context) throws > > IOException, InterruptedException { > > String s = (new > > StringBuilder()).append(value.toString()).append("m").toString(); > > context.write(key, new Text(s)); > > _log.debug(key.toString() + " => " + s); > > } > > } > > > > my reducer class looks like the following: > > > > public class MyReducer extends Reducer<LongWritable, Text, LongWritable, > > Text> { > > > > private static final Log _log = LogFactory.getLog(MyReducer.class); > > @Override > > public void reduce(LongWritable key, Iterable<Text> values, Context > > context) throws IOException, InterruptedException { > > for(Iterator<Text> it = values.iterator(); it.hasNext();) { > > Text txt = it.next(); > > String s = (new > > StringBuilder()).append(txt.toString()).append("r").toString(); > > context.write(key, new Text(s)); > > _log.debug(key.toString() + " => " + s); > > } > > } > > } > > > > my job class looks like the following: > > > > public class MyJob extends Configured implements Tool { > > > > public static void main(String[] args) throws Exception { > > ToolRunner.run(new Configuration(), new MyJob(), args); > > } > > > > @Override > > public int run(String[] args) throws Exception { > > Configuration conf = getConf(); > > Path input = new Path(conf.get("mapred.input.dir")); > > Path output = new Path(conf.get("mapred.output.dir")); > > > > Job job = new Job(conf, "dummy job"); > > job.setMapOutputKeyClass(LongWritable.class); > > job.setMapOutputValueClass(Text.class); > > job.setOutputKeyClass(LongWritable.class); > > job.setOutputValueClass(Text.class); > > > > job.setMapperClass(MyMapper.class); > > job.setReducerClass(MyReducer.class); > > > > FileInputFormat.addInputPath(job, input); > > FileOutputFormat.setOutputPath(job, output); > > > > job.setJarByClass(MyJob.class); > > > > return job.waitForCompletion(true) ? 0 : 1; > > } > > } > > > > the text file that i am trying to read in looks like the following. as > you > > can see, there are 9 lines. > > > > T, T > > T, T > > T, T > > F, F > > F, F > > F, F > > F, F > > T, F > > F, T > > > > the output file that i get after my Job runs looks like the following. as > > you can see, there are 18 lines. each key is emitted twice from the > mapper > > to the reducer. > > > > 0 T, Tmr > > 0 T, Tmr > > 6 T, Tmr > > 6 T, Tmr
|
|