|
|
-
Re: conf.setCombinerClass in Map/ReduceTed Yu 2010-10-06, 22:37
If input to reducer is of <K2,V2>, the combiner would take in <K2,V2> and
emit <K2,V2>. On Tue, Oct 5, 2010 at 10:03 PM, Shi Yu <[EMAIL PROTECTED]> wrote: > Hi, thanks for the answer, Antonio. > > I have found one of the main problem. It was because I used the > MultipleOutputs in the Reduce class, so when I set the Combiner and the > Reducer, the Combiner will not provide normal data flow to the Reducer. > Therefore, the program ceases at the Combiner and no Reducer actually works. > To solve this, I have to use both outputs: > > OutputCollector collector > multipleOutputs.getCollector("stringlabel",keyText,reporter) > collector.collect(keyText, value); > output.collect(key,value); > > The collector generates the separated output files, the output makes sure > the data flow is exchanged towards the Reducer. After this change, both > Combiner and Reducer now work. > > The remaining question is if I want to use the Combiner and the Reducer, > should the input and output of Reduce class be the same <K2,V2>? Otherwise > how to do it? I found the use case is very limited here, for example, if the > Reducer class is a little bit complicated having the input as <K2,V2> and > output as <K3,V3>? > > Thanks again. > > Shi > > > > On 2010-10-5 23:48, Antonio Piccolboni wrote: > >> On Tue, Oct 5, 2010 at 4:32 PM, Shi Yu<[EMAIL PROTECTED]> wrote: >> >> >> >>> Hi, >>> >>> I am still confused about the effect of using Combiner in Hadoop >>> Map/Reduce. The performance tips ( >>> >>> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ >>> ) >>> suggest us to write a combiner to do initial aggregation before the data >>> hits the reducer for performance advantages. But in most of the example >>> code >>> or book I have seen, a same reduce class is set as the reducer and the >>> combiner, such as >>> >>> conf.setCombinerClass(Reduce.class); >>> >>> conf.setReducerClass(Reduce.class); >>> >>> >>> I don't know what is the specific reason doing like this. In my own code >>> based on Hadoop 0.19.2, if I set the combiner class as the reduce class >>> using MultipleOutputs, the output files will be named as xxx-m-00000. And >>> if >>> there are multiple input paths, the number of output files will be the >>> same >>> as the input paths number. The conf.setNumReduceTasks(int) has no use to >>> control the output file number now. I wonder where are the reducer >>> generated >>> outputs in this case because I cannot see them. To see the reducer >>> output, >>> I have to remove the combiner class >>> >>> //conf.setCombinerClass(Reduce.class); >>> >>> conf.setReducerClass(Reduce.class); >>> >>> >>> and then get the output files named as xxx-r-00000. I could then control >>> the output file number using conf.setNumReduceTasks(int). >>> >>> So my question is what is the main advantage to set combiner class and >>> reducer class using the same reduce class? >>> >>> >> >> When the calculation performed by the reducer is commutative and >> associative, with a combiner you get more work done before the shuffle, >> less >> sorting and shuffling and less work in the reducer. Like in the word count >> app, the mapper emits<"the", 1> a billion times, but with a combiner >> equal >> to the reducer only<the, 10^9> has to travel to the reducer. If you >> couldn't use the combiner, not only the shuffle phase would be as heavy as >> if you had a billion distinct words, but also the poor reducer that gets >> the >> "the" key would be very slow. So you would have to go through multiple >> mapreduce phases to aggregate the data anyway. >> >> >> >> >> >>> How to merge the output files in this case? >>> >>> >> >> While I am not sure what you mean, there is no difference to you. The >> output >> is the same. >> >> >> >> >> >>> And where to find any real example using different Combiner/Reducer >>> classes >>> to improve the map/reduce performance? >>> >>> >>> >> If you want to compute an average, the combiner needs to do only sums, the >> reducer sums and the final division. It would not be OK to divide in the |