Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Distributing Keys across Reducers


Copy link to this message
-
Re: Distributing Keys across Reducers
Hi Dave,

I haven't actually done this in practice, so take this with a grain of
salt ;-)

One way to circumvent your problem might be to add entropy to the keys,
i.e., if your keys are "a", "b" etc. and you got too many "a"s and too
many "b"s, you could inflate your keys randomly to be (a, 1), ..., (a,
100), (b, 1), ..., (b, 100) etc. and partition over those.

If you know the distribution of the key space beforehand, you could
inflate each key in such as way as to make the resulting distribution
uniform.

The downside of this approach is that you need to collect the reducer
outputs for (a, 1) through (a, 100) and compute the value for "a" (same
for "b", etc. of course). Depending on what you do, this might be a
simple operation or a second MapReduce job.

There's a blog post explaining this idea:

http://blog.rapleaf.com/dev/2010/03/08/dealing-with-skewed-key-sizes-in-cascading/

Regards,
Christoph

On 20.07.2012 15:20, Dave Shine wrote:
> I have a job that is emitting over 3 billion rows from the map to the
> reduce.  The job is configured with 43 reduce tasks.  A perfectly even
> distribution would amount to about 70 million rows per reduce task.
> However I actually got around 60 million for most of the tasks, one task
> got over 100 million, and one task got almost 350 million.  This uneven
> distribution caused the job to run exceedingly long.
>
> I believe this is referred to as a “key skew problem”, which I know is
> heavily dependent on the actual data being processed.  Can anyone point
> me to any blog posts, white papers, etc. that might give me some options
> on how to deal with this issue?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB