Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> conf.setCombinerClass in Map/Reduce


+
Shi Yu 2010-10-05, 23:32
+
Antonio Piccolboni 2010-10-06, 04:48
+
Shi Yu 2010-10-06, 05:03
Copy link to this message
-
Re: conf.setCombinerClass in Map/Reduce
If input to reducer is of <K2,V2>, the combiner would take in <K2,V2> and
emit <K2,V2>.

On Tue, Oct 5, 2010 at 10:03 PM, Shi Yu <[EMAIL PROTECTED]> wrote:

> Hi, thanks for the answer, Antonio.
>
> I have found one of the main problem. It was because I used the
> MultipleOutputs in the Reduce class, so when I set the Combiner and the
> Reducer, the Combiner will not provide normal data flow to the Reducer.
> Therefore, the program ceases at the Combiner and no Reducer actually works.
> To solve this, I have to use both outputs:
>
> OutputCollector collector > multipleOutputs.getCollector("stringlabel",keyText,reporter)
> collector.collect(keyText, value);
> output.collect(key,value);
>
> The collector generates the separated output files, the output makes sure
> the data flow is exchanged towards the Reducer. After this change, both
> Combiner and Reducer now work.
>
> The remaining question is if I want to use the Combiner and the Reducer,
> should the input and output of Reduce class be the same <K2,V2>? Otherwise
> how to do it? I found the use case is very limited here, for example, if the
> Reducer class is a little bit complicated having the input as <K2,V2> and
> output as <K3,V3>?
>
> Thanks again.
>
> Shi
>
>
>
> On 2010-10-5 23:48, Antonio Piccolboni wrote:
>
>> On Tue, Oct 5, 2010 at 4:32 PM, Shi Yu<[EMAIL PROTECTED]>  wrote:
>>
>>
>>
>>> Hi,
>>>
>>> I am still confused about the effect of using Combiner in Hadoop
>>> Map/Reduce. The performance tips (
>>>
>>> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>>> )
>>> suggest us to write a combiner to do initial aggregation before the data
>>> hits the reducer for performance advantages. But in most of the example
>>> code
>>> or book I have seen, a same reduce class is set as the reducer and the
>>> combiner, such as
>>>
>>> conf.setCombinerClass(Reduce.class);
>>>
>>> conf.setReducerClass(Reduce.class);
>>>
>>>
>>> I don't know what is the specific reason doing like this. In my own code
>>> based on Hadoop 0.19.2, if I set the combiner class as the reduce class
>>> using MultipleOutputs, the output files will be named as xxx-m-00000. And
>>> if
>>> there are multiple input paths, the number of output files will be the
>>> same
>>> as the input paths number. The conf.setNumReduceTasks(int) has no use to
>>> control the output file number now. I wonder where are the reducer
>>> generated
>>> outputs in this case because I cannot see them.  To see the reducer
>>> output,
>>> I have to remove the combiner class
>>>
>>> //conf.setCombinerClass(Reduce.class);
>>>
>>> conf.setReducerClass(Reduce.class);
>>>
>>>
>>> and then get the output files named as xxx-r-00000. I could then control
>>> the output file number using conf.setNumReduceTasks(int).
>>>
>>> So my question is what is the main advantage to set combiner class and
>>> reducer class using the same reduce class?
>>>
>>>
>>
>> When the calculation performed by the reducer is commutative and
>> associative, with a combiner you get more work done before the shuffle,
>> less
>> sorting and shuffling and less work in the reducer. Like in the word count
>> app, the mapper emits<"the", 1>  a billion times, but with a combiner
>> equal
>> to the reducer only<the, 10^9>  has to travel to the reducer. If you
>> couldn't use the combiner, not only the shuffle phase would be as heavy as
>> if you had a billion distinct words, but also the poor reducer that gets
>> the
>> "the" key would be very slow. So you would have to go through multiple
>> mapreduce phases to aggregate the data anyway.
>>
>>
>>
>>
>>
>>> How to merge the output files in this case?
>>>
>>>
>>
>> While I am not sure what you mean, there is no difference to you. The
>> output
>> is the same.
>>
>>
>>
>>
>>
>>> And where to find any real example using different Combiner/Reducer
>>> classes
>>> to improve the map/reduce performance?
>>>
>>>
>>>
>> If you want to compute an average, the combiner needs to do only sums, the
>> reducer sums and the final division. It would  not be OK to divide in the