Say my mappers produce at most (or precisely) 4 output keys. Say I designate the job to have at least (or precisely) 4 reducers. I have noticed that it is not guaranteed that all four reducers will be used, one per key. Rather, it is entirely likely that one reducer won't be used at all and another will receive two sets of keys, first receiving all values of one key, then all values of the other key.
This has horrible implications in terms of parallel performance of course. It effectively doubles the theoretically optimal reduce-phase time.
I have been told that the only way to achieve a more ideal distribution of work is to write my own partitioner. I'm willing to do that, we've done it before within our group on this project, but I don't want to do any unnecessary work. I'm mildly surprised that there isn't a configuration setting that will achieve my desired goal here. Was the advice I received correct? Can my goal only be achieved by writing a fresh partitioner from scratch?
Keith Wiley [EMAIL PROTECTED] keithwiley.com music.keithwiley.com
"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
-- Galileo Galilei