-Re: Gathering Statistics for use by Mappers?
Steve Lewis 2013-09-02, 19:23
There is a max counters - 6000 I think bit it can be raised by setting
"mapreduce.job.counters.max - I have done this but don't know
the performance implications.
What I am doing is comparing all items with a specific key with each other.
This requires all elements to be held in memory. If the number of items
gets too large, the task may be split among multiple reducers but the
mapper needs this information as it starts assigning keys - hence I need
the statistics early. Keys with fewer values can be handled by a single
For the business logic combiners would probably be of little use. Even
for two records having the same key, say zipcode. the odds that they are
'next door neighbors' and could be combined are pretty low.
Yes, the number of keys is not high but every value is a pretty complex
serialized object with a lot of processing to handle it.
On Sun, Sep 1, 2013 at 1:02 PM, Tim Robertson <[EMAIL PROTECTED]>wrote:
> Hey Steve,
> If I recall correctly the total number of counters you have is limited.
> It's been a while since I looked at that code, but I seem to recall the
> counters get pushed to JT in heartbeat messaging and are held in JT memory.
> Anyway, 1) sounds like you'll hit limits, so I'd suggest starting with 2) .
> Can you elaborate what will happen in the subsequent jobs? Is this
> something you can take multiple passes on?
> E.g. in Job 1, emit the keys from mappers as a compound of (ZipCode :
> module100(zipcode)), so that at worst case you are dealing with a 100th of
> a zipcode at a time in a reducer. You'd then pass the first job output
> into the second to combine again and group the 100ths of the zipcode data
> together. This would probably only work on some operations though.
> Are you making use of Combiners? They effectively do what I say above but
> do a mini-reduce as the output of each map.
> Are you using Hive or vanilla MR? The hive folks are making a lot of
> advances in this area (e.g. ORC files).
> 10s to 100s millions of keys does not sound that many though.
> On Sun, Sep 1, 2013 at 7:49 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
>> I am solving a problem which will involve several chained Map-Reduce
>> tasks. On the second task I want the mapper to send every value in a
>> specific 'area' (imagine customers in a zip code) of the data space to a
>> reducer. These need to be compared and, following an earlier suggestion,
>> when the number of customers in a 'zipcode' gets large the reducer will
>> perform poorly or not at all.
>> As long as every mapper knows the total number of customers in every
>> zip code they can make the same decision as to whether and how to break up
>> a specific zip code.
>> At the end of the first reduce pass I have processed every customer and
>> should have been able to gather statistics before launching the next job.
>> I see two options: First to use counters so I might say something like
>> context.getCounter("Zipcodes", MyZipcode.toString()).increment(1);
>> 1) The number of zipcodes might be 4000 or as high as 50,000 in which
>> case I would need to raise the maximum number of counters
>> 2) in addition to using my normal set of key-value pairs emit special
>> keys for zipcodes with the key being zipcode (say with "_" prepended to
>> distinguish it from other Text keys and the value being a count as
>> WordCount does.
>> Approach 1 maintains a large number of counters but at the end of my
>> first job I will have the statistic2 and can pass then to the second job.
>> Approach 2 doubles the number of key-values to sort (we are in the tens
>> to hundreds of millions in the larger jobs) and adds processing code as
>> well as the requirement to read and process result files.
>> Any comments on the two approaches or ideas that I might not have thought
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033