-RE: When reduce function is used as combiner?
David Parks 2012-12-11, 11:32
The map task may use a combiner 0+ times. Basically that means (as far as I
understand), if the map output data is below some internal hadoop threshold,
it'll just send it to the reducer, if it's larger then it'll run it through
the combiner first. And at hadoops discretion, it may run the combiner more
than once on the same set of data if it deems it likely to be useful (the
algorithms which determine that are beyond my understanding).
Your second question, "Is there any maximum size.": Hadoop is, as I
understand, looking at the whole of the map output to determine if it should
run the combiner, not the individual keys/values.
"Values must be the same correct?", yes, your combiner keys must match the
mapper. If that's different from your reducer you'll need a separate
combiner class, which may, other than the output type, be the same business
Fourth question: The reduce phase will run only once, it's only the combiner
that may be run a variable number of times. The output of your reduce phase
goes straight to whatever filesystem you've defined for the output (i.e.
HDFS or S3 usually).
From: Majid Azimi [mailto:[EMAIL PROTECTED]]
Sent: Friday, December 07, 2012 9:02 PM
To: [EMAIL PROTECTED]
Subject: When reduce function is used as combiner?
When reduce function is used as combiner? It is used as combiner when the
iterable passed to reduce function is large? correct?
Is there any maximum size for that iterable? I mean for example if that
iterable size is more than 1000 then reduce function will be called more
than once for that key.
another question is when reduce function is used as combiner the Input Key,
Value and output Key, Value must be the same. correct? If it is different
what will happen? exception thrown at runtime?
Forth question is: lets say iterable size is very large so hadoop will add
output of reduce to iterable and pass it to reduce again with other values
that have not been processed. The question is when hadoop will now that from
that point output of reduce function should be written to HDFS as a real
output? When there is no more value to put into that iterable?