Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - How to handle imbalanced data in hadoop ?


Copy link to this message
-
Re: How to handle imbalanced data in hadoop ?
Jeff Hammerbacher 2009-11-16, 00:25
Hey Jeff,

You may be interested in the Skewed Design specification from the Pig team:
http://wiki.apache.org/pig/PigSkewedJoinSpec.

Regards,
Jeff

On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> wrote:

> My first thought is that it depends on the reduce logic. If you could do
> the
> reduction in two passes then you could do an initial arbitrary partition
> for
> the majority key and bring the partitions together in a second reduction
> (or
> a map-side join). I would use a round robin strategy to assign the
> arbitrary
> partitions.
>
>
>
>
> On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>
> > Hi all,
> >
> > Today there's a problem about imbalanced data come out of mind .
> >
> > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > dominates the map output, say 99%. So 99% data set will go to one
> reducer,
> > and this reducer will become the bottleneck.
> >
> > Does hadoop have any other better ways to handle such imbalanced data set
> ?
> >
> >
> > Jeff Zhang
> >
>