Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> How to handle imbalanced data in hadoop ?


Copy link to this message
-
Re: How to handle imbalanced data in hadoop ?
Hey Jeff,

You may be interested in the Skewed Design specification from the Pig team:
http://wiki.apache.org/pig/PigSkewedJoinSpec.

Regards,
Jeff

On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> wrote:

> My first thought is that it depends on the reduce logic. If you could do
> the
> reduction in two passes then you could do an initial arbitrary partition
> for
> the majority key and bring the partitions together in a second reduction
> (or
> a map-side join). I would use a round robin strategy to assign the
> arbitrary
> partitions.
>
>
>
>
> On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>
> > Hi all,
> >
> > Today there's a problem about imbalanced data come out of mind .
> >
> > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > dominates the map output, say 99%. So 99% data set will go to one
> reducer,
> > and this reducer will become the bottleneck.
> >
> > Does hadoop have any other better ways to handle such imbalanced data set
> ?
> >
> >
> > Jeff Zhang
> >
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB