Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> How to handle imbalanced data in hadoop ?


+
Jeff Zhang 2009-11-15, 04:03
+
brien colwell 2009-11-15, 22:00
Copy link to this message
-
Re: How to handle imbalanced data in hadoop ?
Hey Jeff,

You may be interested in the Skewed Design specification from the Pig team:
http://wiki.apache.org/pig/PigSkewedJoinSpec.

Regards,
Jeff

On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> wrote:

> My first thought is that it depends on the reduce logic. If you could do
> the
> reduction in two passes then you could do an initial arbitrary partition
> for
> the majority key and bring the partitions together in a second reduction
> (or
> a map-side join). I would use a round robin strategy to assign the
> arbitrary
> partitions.
>
>
>
>
> On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote:
>
> > Hi all,
> >
> > Today there's a problem about imbalanced data come out of mind .
> >
> > I'd like to know how hadoop handle this kind of data.  e.g. one key
> > dominates the map output, say 99%. So 99% data set will go to one
> reducer,
> > and this reducer will become the bottleneck.
> >
> > Does hadoop have any other better ways to handle such imbalanced data set
> ?
> >
> >
> > Jeff Zhang
> >
>
+
Pankil Doshi 2009-11-17, 21:54
+
Todd Lipcon 2009-11-17, 22:07
+
Amogh Vasekar 2009-11-18, 11:54
+
Pankil Doshi 2009-11-18, 19:16
+
Runping Qi 2009-11-18, 20:34
+
Pankil Doshi 2009-11-18, 22:53
+
Todd Lipcon 2009-11-24, 05:32
+
Todd Lipcon 2009-11-24, 07:14
+
Todd Lipcon 2009-11-18, 22:55
+
Ted Yu 2009-11-18, 00:05
+
Ted Xu 2009-11-24, 15:35
+
Jeff Zhang 2009-11-24, 16:09
+
Todd Lipcon 2009-11-24, 16:18
+
Pankil Doshi 2009-11-24, 18:35