-Re: How to do aggregate operations with multiple reducers
Adam Kawa 2013-11-30, 16:37
> I know I could just write a script that examines the output from the
> reducers and picks out the value with the highest temperature.
In general, I think that this is an acceptable solution.
However, if you have a very large number of reducer (let's say, large
hundreds or thousands), then reading the content of thousands of files from
HDFS can be slow. I would slightly optimize it by using MutlipleOutputs
and configure it in such a way, that each reducer is writing empty output
to a file named max-temp_reducer-id.part. Then the last file in your output
directory is the one that contains the maximal temperature in first part of
filename (alternatively, you can use a name convention
<MAX-LONG-max-temp>_reducer-id.part, so that the file with the maximal temp
will be the first one in your output directory -> might it be faster?).
Obviously to minimize the number of reduce tasks, you can use Combiner
class in this case, so that reducers will have less data to process.
An alternative way that I can imagine: If you have a couple of reducers
only, each of them can publish maximal temperature as a counter, and when
the job finishes, you can get counters from Job object and find the one
with the highest temperature. Please not that counters are "expensive" (at
least in MRv1), so that you should not have many of them (up to tens, but
rather not hundreds