

Re: Efficiently partition broadly distributed keys
If I understand your problem correctly you actually need some way of
knowing if you need to "chop" a large set with a specific key in to subsets. In mapreduce the map only has information about a single key at a time. So you need something extra. One way of handling this is to start by doing a statistical run first and use the output to determine which keys can be chopped. So first do a MR to determine per key how many records and/or bytes need to be handled. In the reducer you have a "lower limit" that ensures the reducer only outputs the keys that need to be chopped. In the second pass you do your actual algorithm with one twist: In the mapper you use the output of the first run to determine if the key needs to be rewritten and in how many variants. You then randomly (!) choose a variant. Example of what i mean: Assume you have 100 records with key A and 10000 records with key B. You determine that you want key groups of an approximate maximum of 1000 records. The first pass outputs that key B must be split into 10 variants (=10000/1000). Then the second MAP will transform a key B into one of these randomly: B1, B2, B3 ... A record with key A will remain unchanged. Disadvantage: Each run will show different results. Only works if the set of keys that needs to be chopped is small enough so you can have it in memory in the call to the second map. HTH Niels Basjes 2011/3/10 Luca Aiello <[EMAIL PROTECTED]>: > Dear users, > > hope this is the right list to submit this one, otherwise I apologize. > > I'd like to have your opinion about a problem that I'm facing on MapReduce framework. I am writing my code in Java and running on a grid. > > I have a textual input structured in <key, value> pairs. My task is to make the cartesian product of all the values that have the same key. > I can do so it simply using <key> as my map key, so that every value with the same key is put in the same reducer, where I can easily process them and obtain the cartesian. > > However, my keys are not uniformly distributed, the distribution is very broad. As a result of this, the output of my reducers will be very unbalanced and I will have many small files (some KB) and a bunch of huge files (tens of GB). A suboptimal yet acceptable approximation of my task would be to make the cartesian product of smaller chunks of values for very frequent keys, so that the load is distributed evenly among reducers. I am wondering how can I do this in the most efficient/elegant way. > > It appears to me that using a customized Partitioner is not the right way to act, since records with the same key have still to be mapped together (am I right?). > The only solution that comes into my mind is to split the key space artificially insider the mapper (e.g., for a frequent key "ABC", map the values on the reducers using keys like "ABC1", "ABC2", an so on). This would require an additional postprocessing cleanup phase to retrieve the original keys. > > Do you know a better, possibly automatic way to perform this task? > > Thank you! > Best > > Luca > >  Met vriendelijke groeten, Niels Basjes 
