I am solving a problem which will involve several chained Map-Reduce tasks.
On the second task I want the mapper to send every value in a specific
'area' (imagine customers in a zip code) of the data space to a reducer.
These need to be compared and, following an earlier suggestion, when the
number of customers in a 'zipcode' gets large the reducer will perform
poorly or not at all.
As long as every mapper knows the total number of customers in every zip
code they can make the same decision as to whether and how to break up a
specific zip code.
At the end of the first reduce pass I have processed every customer and
should have been able to gather statistics before launching the next job.
I see two options: First to use counters so I might say something like
1) The number of zipcodes might be 4000 or as high as 50,000 in which case
I would need to raise the maximum number of counters
2) in addition to using my normal set of key-value pairs emit special keys
for zipcodes with the key being zipcode (say with "_" prepended to
distinguish it from other Text keys and the value being a count as
Approach 1 maintains a large number of counters but at the end of my first
job I will have the statistic2 and can pass then to the second job.
Approach 2 doubles the number of key-values to sort (we are in the tens to
hundreds of millions in the larger jobs) and adds processing code as well
as the requirement to read and process result files.
Any comments on the two approaches or ideas that I might not have thought