If I recall correctly the total number of counters you have is limited.
It's been a while since I looked at that code, but I seem to recall the
counters get pushed to JT in heartbeat messaging and are held in JT memory.
Anyway, 1) sounds like you'll hit limits, so I'd suggest starting with 2) .
Can you elaborate what will happen in the subsequent jobs? Is this
something you can take multiple passes on?
E.g. in Job 1, emit the keys from mappers as a compound of (ZipCode :
module100(zipcode)), so that at worst case you are dealing with a 100th of
a zipcode at a time in a reducer. You'd then pass the first job output
into the second to combine again and group the 100ths of the zipcode data
together. This would probably only work on some operations though.
Are you making use of Combiners? They effectively do what I say above but
do a mini-reduce as the output of each map.
Are you using Hive or vanilla MR? The hive folks are making a lot of
advances in this area (e.g. ORC files).
10s to 100s millions of keys does not sound that many though.
On Sun, Sep 1, 2013 at 7:49 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> I am solving a problem which will involve several chained Map-Reduce
> tasks. On the second task I want the mapper to send every value in a
> specific 'area' (imagine customers in a zip code) of the data space to a
> reducer. These need to be compared and, following an earlier suggestion,
> when the number of customers in a 'zipcode' gets large the reducer will
> perform poorly or not at all.
> As long as every mapper knows the total number of customers in every zip
> code they can make the same decision as to whether and how to break up a
> specific zip code.
> At the end of the first reduce pass I have processed every customer and
> should have been able to gather statistics before launching the next job.
> I see two options: First to use counters so I might say something like
> context.getCounter("Zipcodes", MyZipcode.toString()).increment(1);
> 1) The number of zipcodes might be 4000 or as high as 50,000 in which
> case I would need to raise the maximum number of counters
> 2) in addition to using my normal set of key-value pairs emit special keys
> for zipcodes with the key being zipcode (say with "_" prepended to
> distinguish it from other Text keys and the value being a count as
> WordCount does.
> Approach 1 maintains a large number of counters but at the end of my first
> job I will have the statistic2 and can pass then to the second job.
> Approach 2 doubles the number of key-values to sort (we are in the tens to
> hundreds of millions in the larger jobs) and adds processing code as well
> as the requirement to read and process result files.
> Any comments on the two approaches or ideas that I might not have thought