|
|
Mathias Herberts 2011-10-29, 10:52
Hi,
I'm designing a 'Hadoop MapReduce Poster', putting all pieces together so people will easily be able to visualize the full M/R flow.
Concerning the combiners, I have a few points I'd like to have clarified.
If I'm not mistaken, the output of the Mapper is passed to the Partitioner which will dispatch K,V into R partitions.
<K,V> for each partition then go through the set SortComparatorClass for sorting.
If there is a combiner, the sorted output is grouped using the SortComparatorClass (and not the GroupingComparatorClass as it's the case in the Reducer) and passed to the combiner prior to be written to the partition file.
My question is, what happens if the combiner outputs different keys than what it is being fed? The output of the combiner will suffer two flaws:
1. It won't be sorted 2. It might end up in the wrong partition
Since a Combiner is simply a Reducer with no other constraints, nothing seems to prevent those 2 problems.
Is my understanding correct?
Mathias.
Owen O'Malley 2011-10-31, 03:22
On Sat, Oct 29, 2011 at 3:52 AM, Mathias Herberts < [EMAIL PROTECTED]> wrote:
> My question is, what happens if the combiner outputs different keys > than what it is being fed? The output of the combiner will suffer two > flaws: > > 1. It won't be sorted > 2. It might end up in the wrong partition >
Yes. We've talked about adding various checks, but I don't think anyone has added them. We obviously have the input key and one option would be to ignore the output key. > Since a Combiner is simply a Reducer with no other constraints, >
That isn't true. Combiners are required to be: 1. Idempotent - The number of times the combiner is applied can't change the output 2. Transititive - The order of the inputs can't change the output 3. Side-effect free - Combiners can't have side effects (or they won't be idempotent). 4. Preserve the sort order - They can't change the keys to disrupt the sort order 5. Preserve the partitioning - They can't change the keys to change the parititioning
All 5 of them are required for combiners.
-- Owen
Mathias Herberts 2011-10-31, 12:41
> Yes. We've talked about adding various checks, but I don't think anyone has > added them. We obviously have the input key and one option would be to > ignore the output key.
ok.
> >> Since a Combiner is simply a Reducer with no other constraints, >> > > That isn't true. Combiners are required to be:
When I meant 'Reducer with no other constraints' I meant Job.setCombinerClass takes a Class<? extends Reducer>, not a Combiner class which enforces stricter constraints.
Thanks for listing the 5 requirements, if you don't mind I'll add them to the Hadoop MapReducer Poster.
Owen O'Malley 2011-10-31, 15:15
On Mon, Oct 31, 2011 at 5:41 AM, Mathias Herberts < [EMAIL PROTECTED]> wrote:
> Thanks for listing the 5 requirements, if you don't mind I'll add them > to the Hadoop MapReducer Poster. >
Sure.
On 30 October 2011 20:22, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> On Sat, Oct 29, 2011 at 3:52 AM, Mathias Herberts < > [EMAIL PROTECTED]> wrote: > > > My question is, what happens if the combiner outputs different keys > > than what it is being fed? The output of the combiner will suffer two > > flaws: > > > > 1. It won't be sorted > > 2. It might end up in the wrong partition > > > > Yes. We've talked about adding various checks, but I don't think anyone has > added them. We obviously have the input key and one option would be to > ignore the output key. > > > > Since a Combiner is simply a Reducer with no other constraints, > > > > That isn't true. Combiners are required to be: > 1. Idempotent - The number of times the combiner is applied can't change > the output > 2. Transititive - The order of the inputs can't change the output > 3. Side-effect free - Combiners can't have side effects (or they won't be > idempotent). > 4. Preserve the sort order - They can't change the keys to disrupt the > sort order > 5. Preserve the partitioning - They can't change the keys to change the > parititioning > > All 5 of them are required for combiners. >
I can't add to Owen's reply about Hadoop, other than perhaps to suggest "associative" might be the word?
However, I also mess with these things at the theoretical level for my own entertainment. This same question caught me out at first, because mathematically speaking, the ability to change the keys in the combiner does not prevent an efficient implementation from being constructed, but would make the output buffer management somewhat more complex. So if the restriction seems nonobvious, that's why.
Also, if you're explaining pure mapreduce, it also works without sorting (assuming the user application doesn't require sorted output), all you need is a gather operator, but (skipping tricks like bucket sort) since the most efficient way to gather is to use a comparison-sort, one might as well.
Back to the grindstone.
S.
|
|