Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> conf.setCombinerClass in Map/Reduce


Copy link to this message
-
Re: conf.setCombinerClass in Map/Reduce
If input to reducer is of <K2,V2>, the combiner would take in <K2,V2> and
emit <K2,V2>.

On Tue, Oct 5, 2010 at 10:03 PM, Shi Yu <[EMAIL PROTECTED]> wrote:

> Hi, thanks for the answer, Antonio.
>
> I have found one of the main problem. It was because I used the
> MultipleOutputs in the Reduce class, so when I set the Combiner and the
> Reducer, the Combiner will not provide normal data flow to the Reducer.
> Therefore, the program ceases at the Combiner and no Reducer actually works.
> To solve this, I have to use both outputs:
>
> OutputCollector collector > multipleOutputs.getCollector("stringlabel",keyText,reporter)
> collector.collect(keyText, value);
> output.collect(key,value);
>
> The collector generates the separated output files, the output makes sure
> the data flow is exchanged towards the Reducer. After this change, both
> Combiner and Reducer now work.
>
> The remaining question is if I want to use the Combiner and the Reducer,
> should the input and output of Reduce class be the same <K2,V2>? Otherwise
> how to do it? I found the use case is very limited here, for example, if the
> Reducer class is a little bit complicated having the input as <K2,V2> and
> output as <K3,V3>?
>
> Thanks again.
>
> Shi
>
>
>
> On 2010-10-5 23:48, Antonio Piccolboni wrote:
>
>> On Tue, Oct 5, 2010 at 4:32 PM, Shi Yu<[EMAIL PROTECTED]>  wrote:
>>
>>
>>
>>> Hi,
>>>
>>> I am still confused about the effect of using Combiner in Hadoop
>>> Map/Reduce. The performance tips (
>>>
>>> http://www.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/
>>> )
>>> suggest us to write a combiner to do initial aggregation before the data
>>> hits the reducer for performance advantages. But in most of the example
>>> code
>>> or book I have seen, a same reduce class is set as the reducer and the
>>> combiner, such as
>>>
>>> conf.setCombinerClass(Reduce.class);
>>>
>>> conf.setReducerClass(Reduce.class);
>>>
>>>
>>> I don't know what is the specific reason doing like this. In my own code
>>> based on Hadoop 0.19.2, if I set the combiner class as the reduce class
>>> using MultipleOutputs, the output files will be named as xxx-m-00000. And
>>> if
>>> there are multiple input paths, the number of output files will be the
>>> same
>>> as the input paths number. The conf.setNumReduceTasks(int) has no use to
>>> control the output file number now. I wonder where are the reducer
>>> generated
>>> outputs in this case because I cannot see them.  To see the reducer
>>> output,
>>> I have to remove the combiner class
>>>
>>> //conf.setCombinerClass(Reduce.class);
>>>
>>> conf.setReducerClass(Reduce.class);
>>>
>>>
>>> and then get the output files named as xxx-r-00000. I could then control
>>> the output file number using conf.setNumReduceTasks(int).
>>>
>>> So my question is what is the main advantage to set combiner class and
>>> reducer class using the same reduce class?
>>>
>>>
>>
>> When the calculation performed by the reducer is commutative and
>> associative, with a combiner you get more work done before the shuffle,
>> less
>> sorting and shuffling and less work in the reducer. Like in the word count
>> app, the mapper emits<"the", 1>  a billion times, but with a combiner
>> equal
>> to the reducer only<the, 10^9>  has to travel to the reducer. If you
>> couldn't use the combiner, not only the shuffle phase would be as heavy as
>> if you had a billion distinct words, but also the poor reducer that gets
>> the
>> "the" key would be very slow. So you would have to go through multiple
>> mapreduce phases to aggregate the data anyway.
>>
>>
>>
>>
>>
>>> How to merge the output files in this case?
>>>
>>>
>>
>> While I am not sure what you mean, there is no difference to you. The
>> output
>> is the same.
>>
>>
>>
>>
>>
>>> And where to find any real example using different Combiner/Reducer
>>> classes
>>> to improve the map/reduce performance?
>>>
>>>
>>>
>> If you want to compute an average, the combiner needs to do only sums, the
>> reducer sums and the final division. It would  not be OK to divide in the
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB