|
Steve Lewis
2012-01-24, 17:33
Sameer Farooqui
2012-01-24, 20:22
Raj V
2012-01-24, 22:34
Robert Evans
2012-01-25, 15:36
Raj V
2012-01-25, 15:46
|
-
When to use a combiner?Steve Lewis 2012-01-24, 17:33
In working a sample issue I used a combiner - I noticed that the Combiner
output records were 90% of the Combiner Input records and when looking at the data found relatively few duplicated keys. This raises the question of what fraction of duplicate keys makes it reasonable to use a combiner - If every key is unique I presume that using a combiner will waste time and resources - especially if the data is large but what fraction of duplicated keys is needed to justify a combiner?? -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: When to use a combiner?Sameer Farooqui 2012-01-24, 20:22
Hi Steve,
Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see. From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI." One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so: Node 1's Map output: (1950, 20) (1950, 10) (1950, 40) Node 2's Map output: (1950, 0) (1950, 15) The reduce function would get this input after the shuffle phase: (1950, [0, 10, 15, 20, 40]) and the reduce function would output: (1950, 40) But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase: (1950, [40, 15]) and the output from Reduce would be the same. There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time. Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them. -- Sameer Farooqui Systems Architect / HortonWorks > Steve Lewis <mailto:[EMAIL PROTECTED]> > January 24, 2012 9:33 AM > In working a sample issue I used a combiner - I noticed that the Combiner > output records were 90% of the Combiner Input records and > when looking at the data found relatively few duplicated keys. This raises > the question of what fraction of duplicate keys makes it reasonable to > use a combiner - If every key is unique I presume that using a combiner > will waste time and resources - especially if the data is large but > what fraction of duplicated keys is needed to justify a combiner?? >
-
Re: When to use a combiner?Raj V 2012-01-24, 22:34
Just to add to Sameer's response - you cannot use a combiner in case you are finding the average temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages.
You can use a combiner if your reducer function R is like this R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S. Raj >________________________________ > From: Sameer Farooqui <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Tuesday, January 24, 2012 12:22 PM >Subject: Re: When to use a combiner? > > >Hi Steve, > >Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see. > >>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI." > >One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so: > >Node 1's Map output: >(1950, 20) >(1950, 10) >(1950, 40) > >Node 2's Map output: >(1950, 0) >(1950, 15) > >The reduce function would get this input after the shuffle phase: >(1950, [0, 10, 15, 20, 40]) >and the reduce function would output: >(1950, 40) > >But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase: >(1950, [40, 15]) >and the output from Reduce would be the same. > >There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time. > >Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them. > > >-- >Sameer Farooqui >Systems Architect / HortonWorks > > > >Steve Lewis >>January 24, 2012 9:33 AM >>In working a sample issue I used a combiner - I noticed that the Combiner >>output records were 90% of the Combiner Input records and >>when looking at the data found relatively few duplicated keys. This raises >>the question of what fraction of duplicate keys makes it reasonable to >>use a combiner - If every key is unique I presume that using a combiner >>will waste time and resources - especially if the data is large but >>what fraction of duplicated keys is needed to justify a combiner?? >> >> > > > > >
-
Re: When to use a combiner?Robert Evans 2012-01-25, 15:36
You can use a combiner for average. You just have to write a separate combiner from your reducer.
Class myCombiner { //The value is sum/count pairs void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) { long sum = 0; long count = 0; for(Pair<Long, Long> value: values) { sum += pair.first; count += pair.second; } context.write(key, new Pair<Long, Long>(sum, count)); } } Class myReducer { //The value is sum/count pairs void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) { long sum = 0; long count = 0; for(Pair<Long, Long> value: values) { sum += pair.first; count += pair.second; } context.write(key, ((double)sum)/count); } } --Bobby Evans On 1/24/12 4:34 PM, "Raj V" <[EMAIL PROTECTED]> wrote: Just to add to Sameer's response - you cannot use a combiner in case you are finding the average temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages. You can use a combiner if your reducer function R is like this R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S. Raj ________________________________ From: Sameer Farooqui <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, January 24, 2012 12:22 PM Subject: Re: When to use a combiner? Hi Steve, Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see. >From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI." One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so: Node 1's Map output: (1950, 20) (1950, 10) (1950, 40) Node 2's Map output: (1950, 0) (1950, 15) The reduce function would get this input after the shuffle phase: (1950, [0, 10, 15, 20, 40]) and the reduce function would output: (1950, 40) But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase: (1950, [40, 15]) and the output from Reduce would be the same. There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time. Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them.
-
Re: When to use a combiner?Raj V 2012-01-25, 15:46
Touche`!
Raj >________________________________ > From: Robert Evans <[EMAIL PROTECTED]> >To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Raj V <[EMAIL PROTECTED]> >Sent: Wednesday, January 25, 2012 7:36 AM >Subject: Re: When to use a combiner? > > >Re: When to use a combiner? >You can use a combiner for average. You just have to write a separate combiner from your reducer. > >Class myCombiner { > //The value is sum/count pairs > void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) { > long sum = 0; > long count = 0; > for(Pair<Long, Long> value: values) { > sum += pair.first; > count += pair.second; > } > context.write(key, new Pair<Long, Long>(sum, count)); > } >} > >Class myReducer { > //The value is sum/count pairs > void reduce(Key key, Interable<Pair<Long, Long>> values, Context context) { > long sum = 0; > long count = 0; > for(Pair<Long, Long> value: values) { > sum += pair.first; > count += pair.second; > } > context.write(key, ((double)sum)/count); > } >} > >--Bobby Evans > > >On 1/24/12 4:34 PM, "Raj V" <[EMAIL PROTECTED]> wrote: > > >Just to add to Sameer's response - you cannot use a combiner in case you are finding the average temperature. The combiner running on each mapper will produce the average for that mapper's output and the reducer will find the average of the combiner outputs, which in this case will be the average of the averages. >> >>You can use a combiner if your reducer function R is like this >> >>R(S) = R(R(s1), R(s2) ....R(sn)) Where S is the whole set and s1,s2 ... sn are some arbitrary partition of the set S. >> >>Raj >> >> >> >>> >>> >>> >>>>>>________________________________ >>> From:Sameer Farooqui <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Sent: Tuesday, January 24, 2012 12:22 PM >>> Subject: Re: When to use a combiner? >>> >>> >>> >>>Hi Steve, >>> >>>Yeah, you're right in your suspicions that a combiner may not be useful in your use case. It's mainly used to reduce network traffic between the mappers and the reducers. Hadoop may apply the combiner zero, one or multiple times to the intermediate output from the mapper, so it's hard to accurately predict the CPU impact a combiner will have. The reduction in network packets is a lot easier to predict and actually see. >>> >>>>From Chuck Lam's 'Hadoop in Action': "A combiner doesn't necessarily improve performance. You should monitor the job's behavior to see if the number of records outputted by the combiner is meaningfully less than the number of records going in. The reduction must justify the extra execution time of running a combiner. You can easily check this through the JobTracker's Web UI." >>> >>>One thing to point out is don't just assume the combiner's ineffectiveness b/c it's not reducing the # of unique keys emitted from the Map side. It really depends on your specific use case for the combiner and the nature of the MapReduce job. For example, imagine your map tasks find the maximum temperature for a given year (example from 'Hadoop: The Definitive Guide'), like so: >>> >>>Node 1's Map output: >>>(1950, 20) >>>(1950, 10) >>>(1950, 40) >>> >>>Node 2's Map output: >>>(1950, 0) >>>(1950, 15) >>> >>>The reduce function would get this input after the shuffle phase: >>>(1950, [0, 10, 15, 20, 40]) >>>and the reduce function would output: >>>(1950, 40) >>> >>>But if you used a combiner, the reduce function would have gotten smaller input to work with after the shuffle phase: >>>(1950, [40, 15]) >>>and the output from Reduce would be the same. >>> >>>There are specific use cases like the one above that a combiner makes magical performance gains for, but it shouldn't by default be used 100% of the time. >>> >>>Both of the books I mentioned are excellent with tons of real-world tips, so I highly recommend them. >>> > > |