|
|
-
mapreduce linear chaining: ClassCastException
Periya.Data 2011-10-15, 00:31
Hi all, I am trying a simple extension of WordCount example in Hadoop. I want to get a frequency of wordcounts in descending order. To that I employ a linear chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the usual example). For the next MR job => I set the mapper to swap the <word, count> to <count, word>. Then, have the Identity reducer to simply store the results.
My MR-1 does its job correctly and store the result in a temp path.
Question 1: The mapper of the second MR job (MR-2) doesn't like the input format. I have properly set the input format for MapClass2 of what it expects and what its output must be. It seems to expecting a LongWritable. I suspect that it is trying to look at some index file. I am not sure. It throws an error like this:
<code> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text </code>
Some Info: - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it for now. - I use hadoop-0.20.2
For MR-1: - conf1.setOutputKeyClass(Text.class); - conf1.setOutputValueClass(IntWritable.class);
For MR-2 - takes in a Text (word) and IntWritable (sum) - conf2.setOutputKeyClass(IntWritable.class); - conf2.setOutputValueClass(Text.class);
<code> public class MapClass2 extends MapReduceBase implements Mapper<Text, IntWritable, IntWritable, Text> {
@Override public void map(Text word, IntWritable sum, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
output.collect(sum, word); // <sum, word> } } </code>
Any suggestions would be helpful. Is my MapClass2 code right in the first place...for swapping? Or should I assume that mapper reads line by line, so, must read in one line, then, use StrTokenizer to split them up and convert the second token (sum) from str to Int....?? Or should I mess around with OutputKeyComparator class?
Thanks, PD
-
Re: mapreduce linear chaining: ClassCastException
bejoy.hadoop@... 2011-10-15, 08:06
Hi I believe what is happening in your case is that. The first map reduce jobs runs to completion When you trigger the second map reduce job, it is triggered with the default input format, TextInputFormat and definitely expects the key value as LongWritable and Text type. In default the MapReduce jobs output format is TextOutputFormat, key value as tab seperated. When you need to consume this output of an MR job as key value pairs by another MR job, use KeyValueInputFormat, ie while setting config parameters for second job set jobConf.setInputFormat(KeyValueInput Format.class). Now if your output key value pairs use a different separator other than default tab then for second job you need to specify that as well using key.value.separator.in.input.line
In short for your case in second map reduce job doing the following would get things in place -use jobConf.setInputFormat(KeyValueInputFormat.class) -alter your mapper to accept key values of type Text,Text -swap the key and values for output
To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce API. Hope it helps.
Regards Bejoy K S
-----Original Message----- From: "Periya.Data" <[EMAIL PROTECTED]> Date: Fri, 14 Oct 2011 17:31:27 To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: mapreduce linear chaining: ClassCastException
Hi all, I am trying a simple extension of WordCount example in Hadoop. I want to get a frequency of wordcounts in descending order. To that I employ a linear chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the usual example). For the next MR job => I set the mapper to swap the <word, count> to <count, word>. Then, have the Identity reducer to simply store the results.
My MR-1 does its job correctly and store the result in a temp path.
Question 1: The mapper of the second MR job (MR-2) doesn't like the input format. I have properly set the input format for MapClass2 of what it expects and what its output must be. It seems to expecting a LongWritable. I suspect that it is trying to look at some index file. I am not sure. It throws an error like this:
<code> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text </code>
Some Info: - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it for now. - I use hadoop-0.20.2
For MR-1: - conf1.setOutputKeyClass(Text.class); - conf1.setOutputValueClass(IntWritable.class);
For MR-2 - takes in a Text (word) and IntWritable (sum) - conf2.setOutputKeyClass(IntWritable.class); - conf2.setOutputValueClass(Text.class);
<code> public class MapClass2 extends MapReduceBase implements Mapper<Text, IntWritable, IntWritable, Text> {
@Override public void map(Text word, IntWritable sum, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
output.collect(sum, word); // <sum, word> } } </code>
Any suggestions would be helpful. Is my MapClass2 code right in the first place...for swapping? Or should I assume that mapper reads line by line, so, must read in one line, then, use StrTokenizer to split them up and convert the second token (sum) from str to Int....?? Or should I mess around with OutputKeyComparator class?
Thanks, PD
-
Re: mapreduce linear chaining: ClassCastException
bejoy.hadoop@... 2011-10-15, 08:08
Hi I believe what is happening in your case is that. The first map reduce jobs runs to completion When you trigger the second map reduce job, it is triggered with the default input format, TextInputFormat and definitely expects the key value as LongWritable and Text type. In default the MapReduce jobs output format is TextOutputFormat, key value as tab seperated. When you need to consume this output of an MR job as key value pairs by another MR job, use KeyValueInputFormat, ie while setting config parameters for second job set jobConf.setInputFormat(KeyValueInput Format.class). Now if your output key value pairs use a different separator other than default tab then for second job you need to specify that as well using key.value.separator.in.input.line
In short for your case in second map reduce job doing the following would get things in place -use jobConf.setInputFormat(KeyValueInputFormat.class) -alter your mapper to accept key values of type Text,Text -swap the key and values within mapper for output to reducer with conversions.
To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce API.
Hope it helps.
Regards Bejoy K S
-----Original Message----- From: "Periya.Data" <[EMAIL PROTECTED]> Date: Fri, 14 Oct 2011 17:31:27 To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: mapreduce linear chaining: ClassCastException
Hi all, I am trying a simple extension of WordCount example in Hadoop. I want to get a frequency of wordcounts in descending order. To that I employ a linear chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the usual example). For the next MR job => I set the mapper to swap the <word, count> to <count, word>. Then, have the Identity reducer to simply store the results.
My MR-1 does its job correctly and store the result in a temp path.
Question 1: The mapper of the second MR job (MR-2) doesn't like the input format. I have properly set the input format for MapClass2 of what it expects and what its output must be. It seems to expecting a LongWritable. I suspect that it is trying to look at some index file. I am not sure. It throws an error like this:
<code> java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot be cast to org.apache.hadoop.io.Text </code>
Some Info: - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it for now. - I use hadoop-0.20.2
For MR-1: - conf1.setOutputKeyClass(Text.class); - conf1.setOutputValueClass(IntWritable.class);
For MR-2 - takes in a Text (word) and IntWritable (sum) - conf2.setOutputKeyClass(IntWritable.class); - conf2.setOutputValueClass(Text.class);
<code> public class MapClass2 extends MapReduceBase implements Mapper<Text, IntWritable, IntWritable, Text> {
@Override public void map(Text word, IntWritable sum, OutputCollector<IntWritable, Text> output, Reporter reporter) throws IOException {
output.collect(sum, word); // <sum, word> } } </code>
Any suggestions would be helpful. Is my MapClass2 code right in the first place...for swapping? Or should I assume that mapper reads line by line, so, must read in one line, then, use StrTokenizer to split them up and convert the second token (sum) from str to Int....?? Or should I mess around with OutputKeyComparator class?
Thanks, PD
-
Re: mapreduce linear chaining: ClassCastException
Periya.Data 2011-10-15, 17:59
Fantastic ! Thanks much Bejoy. Now, I am able to get the output of my MR-2 nicely. I had to convert the sum (in text) format to IntWritable and I am able to get all the word frequency <Freq, Word> in ascending order. I used "KeyValueTextInputFormat.class". My program was complaining when I used "KeyValueInputFormat".
Now, let me investigate how to do that in descending order...and then top-20...etc. I know I must look into RawComparator and more...
Thanks, PD.
On Sat, Oct 15, 2011 at 1:08 AM, <[EMAIL PROTECTED]> wrote:
> Hi > I believe what is happening in your case is that. > The first map reduce jobs runs to completion > When you trigger the second map reduce job, it is triggered with the > default input format, TextInputFormat and definitely expects the key value > as LongWritable and Text type. In default the MapReduce jobs output format > is TextOutputFormat, key value as tab seperated. When you need to consume > this output of an MR job as key value pairs by another MR job, use > KeyValueInputFormat, ie while setting config parameters for second job set > jobConf.setInputFormat(KeyValueInput Format.class). > Now if your output key value pairs use a different separator other than > default tab then for second job you need to specify that as well using > key.value.separator.in.input.line > > In short for your case in second map reduce job doing the following would > get things in place > -use jobConf.setInputFormat(KeyValueInputFormat.class) > -alter your mapper to accept key values of type Text,Text > -swap the key and values within mapper for output to reducer with > conversions. > > To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce > API. > > Hope it helps. > > Regards > Bejoy K S > > -----Original Message----- > From: "Periya.Data" <[EMAIL PROTECTED]> > Date: Fri, 14 Oct 2011 17:31:27 > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: mapreduce linear chaining: ClassCastException > > Hi all, > I am trying a simple extension of WordCount example in Hadoop. I want to > get a frequency of wordcounts in descending order. To that I employ a > linear > chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the > usual example). For the next MR job => I set the mapper to swap the <word, > count> to <count, word>. Then, have the Identity reducer to simply store > the results. > > My MR-1 does its job correctly and store the result in a temp path. > > Question 1: The mapper of the second MR job (MR-2) doesn't like the input > format. I have properly set the input format for MapClass2 of what it > expects and what its output must be. It seems to expecting a LongWritable. > I > suspect that it is trying to look at some index file. I am not sure. > > > It throws an error like this: > > <code> > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > be cast to org.apache.hadoop.io.Text > </code> > > Some Info: > - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it > for now. > - I use hadoop-0.20.2 > > For MR-1: > - conf1.setOutputKeyClass(Text.class); > - conf1.setOutputValueClass(IntWritable.class); > > For MR-2 > - takes in a Text (word) and IntWritable (sum) > - conf2.setOutputKeyClass(IntWritable.class); > - conf2.setOutputValueClass(Text.class); > > <code> > public class MapClass2 extends MapReduceBase > implements Mapper<Text, IntWritable, IntWritable, Text> { > > @Override > public void map(Text word, IntWritable sum, > OutputCollector<IntWritable, Text> output, > Reporter reporter) throws IOException { > > output.collect(sum, word); // <sum, word> > } > } > </code> > > Any suggestions would be helpful. Is my MapClass2 code right in the first > place...for swapping? Or should I assume that mapper reads line by line, > so, must read in one line, then, use StrTokenizer to split them up and > convert the second token (sum) from str to Int....?? Or should I mess
-
Re: mapreduce linear chaining: ClassCastException
bejoy.hadoop@... 2011-10-15, 19:08
Great!..
Sorry for the KeyValueInputFormat It is KeyValueInputTextFormat itself. I was replying from my handheld and was getting the class name from memory, so excuse me for that. :)
For your further requirements like descending order, playing around with Comparator is required I believe.
Thank you
Regards Bejoy K S
-----Original Message----- From: "Periya.Data" <[EMAIL PROTECTED]> Date: Sat, 15 Oct 2011 10:59:00 To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Subject: Re: mapreduce linear chaining: ClassCastException
Fantastic ! Thanks much Bejoy. Now, I am able to get the output of my MR-2 nicely. I had to convert the sum (in text) format to IntWritable and I am able to get all the word frequency <Freq, Word> in ascending order. I used "KeyValueTextInputFormat.class". My program was complaining when I used "KeyValueInputFormat".
Now, let me investigate how to do that in descending order...and then top-20...etc. I know I must look into RawComparator and more...
Thanks, PD.
On Sat, Oct 15, 2011 at 1:08 AM, <[EMAIL PROTECTED]> wrote:
> Hi > I believe what is happening in your case is that. > The first map reduce jobs runs to completion > When you trigger the second map reduce job, it is triggered with the > default input format, TextInputFormat and definitely expects the key value > as LongWritable and Text type. In default the MapReduce jobs output format > is TextOutputFormat, key value as tab seperated. When you need to consume > this output of an MR job as key value pairs by another MR job, use > KeyValueInputFormat, ie while setting config parameters for second job set > jobConf.setInputFormat(KeyValueInput Format.class). > Now if your output key value pairs use a different separator other than > default tab then for second job you need to specify that as well using > key.value.separator.in.input.line > > In short for your case in second map reduce job doing the following would > get things in place > -use jobConf.setInputFormat(KeyValueInputFormat.class) > -alter your mapper to accept key values of type Text,Text > -swap the key and values within mapper for output to reducer with > conversions. > > To be noted here,AFAIK KeyValueInputFormat is not a part of new mapreduce > API. > > Hope it helps. > > Regards > Bejoy K S > > -----Original Message----- > From: "Periya.Data" <[EMAIL PROTECTED]> > Date: Fri, 14 Oct 2011 17:31:27 > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Reply-To: [EMAIL PROTECTED] > Subject: mapreduce linear chaining: ClassCastException > > Hi all, > I am trying a simple extension of WordCount example in Hadoop. I want to > get a frequency of wordcounts in descending order. To that I employ a > linear > chain of MR jobs. The first MR job (MR-1) does the regular wordcount (the > usual example). For the next MR job => I set the mapper to swap the <word, > count> to <count, word>. Then, have the Identity reducer to simply store > the results. > > My MR-1 does its job correctly and store the result in a temp path. > > Question 1: The mapper of the second MR job (MR-2) doesn't like the input > format. I have properly set the input format for MapClass2 of what it > expects and what its output must be. It seems to expecting a LongWritable. > I > suspect that it is trying to look at some index file. I am not sure. > > > It throws an error like this: > > <code> > java.lang.ClassCastException: org.apache.hadoop.io.LongWritable cannot > be cast to org.apache.hadoop.io.Text > </code> > > Some Info: > - I use old API (org.apache.hadoop.mapred.*). I am asked to stick with it > for now. > - I use hadoop-0.20.2 > > For MR-1: > - conf1.setOutputKeyClass(Text.class); > - conf1.setOutputValueClass(IntWritable.class); > > For MR-2 > - takes in a Text (word) and IntWritable (sum) > - conf2.setOutputKeyClass(IntWritable.class); > - conf2.setOutputValueClass(Text.class); > > <code> > public class MapClass2 extends MapReduceBase > implements Mapper<Text, IntWritable, IntWritable, Text> {
|
|