-Re: Efficiently partition broadly distributed keys
Niels Basjes 2011-03-10, 22:07
2011/3/10 Luca Aiello <[EMAIL PROTECTED]>:
> thanks for the quick response. So in your opinion there is nothing like a "hadoop embedded" tool to do this. This is what I suspected indeed.
The mapreduce model simply uses the <key> as the pivot of the
processing. In your application specific situation your looking for a
way to "break" that model in a smart way. So no, Hadoop doesn't do
> Since the <key, value> pairs in the initial dataset are randomly partitioned in the input files,
> I suppose that I can avoid the initial statistic step and do statistics "on the fly".
> For example, if I have 3 keys A,B,C with probability 0.7, 0.2, and 0.1, every mapper will receive,
> on average, a input composed by 70% of A keys, 20% of Bs and 10% of Cs
> and so it will know that the <key,value> to be chunked are de As.
If you know this in advance you can surely put that into the the
mapper. But that simply means you've done the statistics in advance in
I fail to see how you could do this "on the fly" without prior knowledge.
As a slightly smarter way of doing something like this; Have a look at
the terasort example.
There a sample of the data is used to estimate the distribution.
> This can introduce some errors but it should produce a output which is quite uniformly distributed.
> Thanks again!
> On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote:
>> If I understand your problem correctly you actually need some way of
>> knowing if you need to "chop" a large set with a specific key in to
>> In mapreduce the map only has information about a single key at a
>> time. So you need something extra.
>> One way of handling this is to start by doing a statistical run first
>> and use the output to determine which keys can be chopped.
>> So first do a MR to determine per key how many records and/or bytes
>> need to be handled.
>> In the reducer you have a "lower limit" that ensures the reducer only
>> outputs the keys that need to be chopped.
>> In the second pass you do your actual algorithm with one twist: In the
>> mapper you use the output of the first run to determine if the key
>> needs to be rewritten and in how many variants. You then randomly (!)
>> choose a variant.
>> Example of what i mean:
>> Assume you have 100 records with key A and 10000 records with key B.
>> You determine that you want key groups of an approximate maximum of
>> 1000 records.
>> The first pass outputs that key B must be split into 10 variants (=10000/1000).
>> Then the second MAP will transform a key B into one of these randomly:
>> B-1, B-2, B-3 ...
>> A record with key A will remain unchanged.
>> Each run will show different results.
>> Only works if the set of keys that needs to be chopped is small
>> enough so you can have it in memory in the call to the second map.
>> Niels Basjes
>> 2011/3/10 Luca Aiello <[EMAIL PROTECTED]>:
>>> Dear users,
>>> hope this is the right list to submit this one, otherwise I apologize.
>>> I'd like to have your opinion about a problem that I'm facing on MapReduce framework. I am writing my code in Java and running on a grid.
>>> I have a textual input structured in <key, value> pairs. My task is to make the cartesian product of all the values that have the same key.
>>> I can do so it simply using <key> as my map key, so that every value with the same key is put in the same reducer, where I can easily process them and obtain the cartesian.
>>> However, my keys are not uniformly distributed, the distribution is very broad. As a result of this, the output of my reducers will be very unbalanced and I will have many small files (some KB) and a bunch of huge files (tens of GB). A sub-optimal yet acceptable approximation of my task would be to make the cartesian product of smaller chunks of values for very frequent keys, so that the load is distributed evenly among reducers. I am wondering how can I do this in the most efficient/elegant way.
Met vriendelijke groeten,