|
Jeff Zhang
2009-11-15, 04:03
brien colwell
2009-11-15, 22:00
Jeff Hammerbacher
2009-11-16, 00:25
Pankil Doshi
2009-11-17, 21:54
Todd Lipcon
2009-11-17, 22:07
Ted Yu
2009-11-18, 00:05
Amogh Vasekar
2009-11-18, 11:54
Pankil Doshi
2009-11-18, 19:16
Runping Qi
2009-11-18, 20:34
Pankil Doshi
2009-11-18, 22:53
Todd Lipcon
2009-11-18, 22:55
Todd Lipcon
2009-11-24, 05:32
Todd Lipcon
2009-11-24, 07:14
Ted Xu
2009-11-24, 15:35
Jeff Zhang
2009-11-24, 16:09
Todd Lipcon
2009-11-24, 16:18
Pankil Doshi
2009-11-24, 18:35
|
-
How to handle imbalanced data in hadoop ?Jeff Zhang 2009-11-15, 04:03
Hi all,
Today there's a problem about imbalanced data come out of mind . I'd like to know how hadoop handle this kind of data. e.g. one key dominates the map output, say 99%. So 99% data set will go to one reducer, and this reducer will become the bottleneck. Does hadoop have any other better ways to handle such imbalanced data set ? Jeff Zhang
-
Re: How to handle imbalanced data in hadoop ?brien colwell 2009-11-15, 22:00
My first thought is that it depends on the reduce logic. If you could do the
reduction in two passes then you could do an initial arbitrary partition for the majority key and bring the partitions together in a second reduction (or a map-side join). I would use a round robin strategy to assign the arbitrary partitions. On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > Hi all, > > Today there's a problem about imbalanced data come out of mind . > > I'd like to know how hadoop handle this kind of data. e.g. one key > dominates the map output, say 99%. So 99% data set will go to one reducer, > and this reducer will become the bottleneck. > > Does hadoop have any other better ways to handle such imbalanced data set ? > > > Jeff Zhang >
-
Re: How to handle imbalanced data in hadoop ?Jeff Hammerbacher 2009-11-16, 00:25
Hey Jeff,
You may be interested in the Skewed Design specification from the Pig team: http://wiki.apache.org/pig/PigSkewedJoinSpec. Regards, Jeff On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> wrote: > My first thought is that it depends on the reduce logic. If you could do > the > reduction in two passes then you could do an initial arbitrary partition > for > the majority key and bring the partitions together in a second reduction > (or > a map-side join). I would use a round robin strategy to assign the > arbitrary > partitions. > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > Today there's a problem about imbalanced data come out of mind . > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > dominates the map output, say 99%. So 99% data set will go to one > reducer, > > and this reducer will become the bottleneck. > > > > Does hadoop have any other better ways to handle such imbalanced data set > ? > > > > > > Jeff Zhang > > >
-
Re: How to handle imbalanced data in hadoop ?Pankil Doshi 2009-11-17, 21:54
With respect to Imbalanced data, Can anyone guide me how sorting takes place
in Hadoop after Map phase. I did some experiments and found that if there are two reducers which have same number of keys to sort and one reducer has all the keys same and other have different keys then time taken by by the reducer having all keys same is terribly large then other one. Also I found that length on my Key doesnt matter in the time taken to sort it. I wanted some hints how sorting is done .. Pankil On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED]>wrote: > Hey Jeff, > > You may be interested in the Skewed Design specification from the Pig team: > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > Regards, > Jeff > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> wrote: > > > My first thought is that it depends on the reduce logic. If you could do > > the > > reduction in two passes then you could do an initial arbitrary partition > > for > > the majority key and bring the partitions together in a second reduction > > (or > > a map-side join). I would use a round robin strategy to assign the > > arbitrary > > partitions. > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > > > Hi all, > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > dominates the map output, say 99%. So 99% data set will go to one > > reducer, > > > and this reducer will become the bottleneck. > > > > > > Does hadoop have any other better ways to handle such imbalanced data > set > > ? > > > > > > > > > Jeff Zhang > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Todd Lipcon 2009-11-17, 22:07
On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> wrote:
> With respect to Imbalanced data, Can anyone guide me how sorting takes > place > in Hadoop after Map phase. > > I did some experiments and found that if there are two reducers which have > same number of keys to sort and one reducer has all the keys same and other > have different keys then time taken by by the reducer having all keys same > is terribly large then other one. > > Hi Pankil, This is an interesting experiment you've done with results that I wouldn't quite expect. Do you have the java source available that you used to run this experiment? > Also I found that length on my Key doesnt matter in the time taken to sort > it. > > With small keys on CPU-bound workload this is probably the case since the sort would be dominated by comparison. If you were to benchmark keys that are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. > I wanted some hints how sorting is done .. > MapTask.java, ReduceTask.java, and Merger.java are the key places to look. The actual sort is a relatively basic quicksort, but there is plenty of complexity in the spill/shuffle/merge logic. -Todd > > Pankil > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > >wrote: > > > Hey Jeff, > > > > You may be interested in the Skewed Design specification from the Pig > team: > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > Regards, > > Jeff > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > wrote: > > > > > My first thought is that it depends on the reduce logic. If you could > do > > > the > > > reduction in two passes then you could do an initial arbitrary > partition > > > for > > > the majority key and bring the partitions together in a second > reduction > > > (or > > > a map-side join). I would use a round robin strategy to assign the > > > arbitrary > > > partitions. > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > > > > > Hi all, > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > > dominates the map output, say 99%. So 99% data set will go to one > > > reducer, > > > > and this reducer will become the bottleneck. > > > > > > > > Does hadoop have any other better ways to handle such imbalanced data > > set > > > ? > > > > > > > > > > > > Jeff Zhang > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Ted Yu 2009-11-18, 00:05
Can someone fix the typo on http://wiki.apache.org/pig/PigSkewedJoinSpec in
the first bullet ? tow-table inner join Thanks On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > With respect to Imbalanced data, Can anyone guide me how sorting takes > place > in Hadoop after Map phase. > > I did some experiments and found that if there are two reducers which have > same number of keys to sort and one reducer has all the keys same and other > have different keys then time taken by by the reducer having all keys same > is terribly large then other one. > > Also I found that length on my Key doesnt matter in the time taken to sort > it. > > I wanted some hints how sorting is done .. > > Pankil > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > >wrote: > > > Hey Jeff, > > > > You may be interested in the Skewed Design specification from the Pig > team: > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > Regards, > > Jeff > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > wrote: > > > > > My first thought is that it depends on the reduce logic. If you could > do > > > the > > > reduction in two passes then you could do an initial arbitrary > partition > > > for > > > the majority key and bring the partitions together in a second > reduction > > > (or > > > a map-side join). I would use a round robin strategy to assign the > > > arbitrary > > > partitions. > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > > > > > Hi all, > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > > dominates the map output, say 99%. So 99% data set will go to one > > > reducer, > > > > and this reducer will become the bottleneck. > > > > > > > > Does hadoop have any other better ways to handle such imbalanced data > > set > > > ? > > > > > > > > > > > > Jeff Zhang > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Amogh Vasekar 2009-11-18, 11:54
Hi,
This is the time for all three phases of reducer right? I think its due to the constant spilling for a single key to disk since the map partitions couldn't be held in-mem due to buffer limit. Did the other reducer have numerous keys with low number of values ( ie smaller partitions? ) Thanks, Amogh On 11/18/09 3:37 AM, "Todd Lipcon" <[EMAIL PROTECTED]> wrote: On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > With respect to Imbalanced data, Can anyone guide me how sorting takes > place > in Hadoop after Map phase. > > I did some experiments and found that if there are two reducers which have > same number of keys to sort and one reducer has all the keys same and other > have different keys then time taken by by the reducer having all keys same > is terribly large then other one. > > Hi Pankil, This is an interesting experiment you've done with results that I wouldn't quite expect. Do you have the java source available that you used to run this experiment? > Also I found that length on my Key doesnt matter in the time taken to sort > it. > > With small keys on CPU-bound workload this is probably the case since the sort would be dominated by comparison. If you were to benchmark keys that are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. > I wanted some hints how sorting is done .. > MapTask.java, ReduceTask.java, and Merger.java are the key places to look. The actual sort is a relatively basic quicksort, but there is plenty of complexity in the spill/shuffle/merge logic. -Todd > > Pankil > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > >wrote: > > > Hey Jeff, > > > > You may be interested in the Skewed Design specification from the Pig > team: > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > Regards, > > Jeff > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > wrote: > > > > > My first thought is that it depends on the reduce logic. If you could > do > > > the > > > reduction in two passes then you could do an initial arbitrary > partition > > > for > > > the majority key and bring the partitions together in a second > reduction > > > (or > > > a map-side join). I would use a round robin strategy to assign the > > > arbitrary > > > partitions. > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > > > > > Hi all, > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > > dominates the map output, say 99%. So 99% data set will go to one > > > reducer, > > > > and this reducer will become the bottleneck. > > > > > > > > Does hadoop have any other better ways to handle such imbalanced data > > set > > > ? > > > > > > > > > > > > Jeff Zhang > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Pankil Doshi 2009-11-18, 19:16
Hey Todd,
I will attach dataset and java source used by me. Make sure you use with 10 reducers and also use partitioner class as I have provided. Dataset-1 has smaller key length Dataset-2 has larger key length When I experiment with both dataset , According my partitioner class Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take maximum amount of time in all reducers.( like 17 mins) where as remaining all reducers also get 100000 keys but they all are not same , and these reducers get over in (1 min 30 sec on avg). Pankil On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > > > With respect to Imbalanced data, Can anyone guide me how sorting takes > > place > > in Hadoop after Map phase. > > > > I did some experiments and found that if there are two reducers which > have > > same number of keys to sort and one reducer has all the keys same and > other > > have different keys then time taken by by the reducer having all keys > same > > is terribly large then other one. > > > > > Hi Pankil, > > This is an interesting experiment you've done with results that I wouldn't > quite expect. Do you have the java source available that you used to run > this experiment? > > > > > Also I found that length on my Key doesnt matter in the time taken to > sort > > it. > > > > > With small keys on CPU-bound workload this is probably the case since the > sort would be dominated by comparison. If you were to benchmark keys that > are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. > > > > I wanted some hints how sorting is done .. > > > > MapTask.java, ReduceTask.java, and Merger.java are the key places to look. > The actual sort is a relatively basic quicksort, but there is plenty of > complexity in the spill/shuffle/merge logic. > > -Todd > > > > > > > Pankil > > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > > >wrote: > > > > > Hey Jeff, > > > > > > You may be interested in the Skewed Design specification from the Pig > > team: > > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > > > Regards, > > > Jeff > > > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > > wrote: > > > > > > > My first thought is that it depends on the reduce logic. If you could > > do > > > > the > > > > reduction in two passes then you could do an initial arbitrary > > partition > > > > for > > > > the majority key and bring the partitions together in a second > > reduction > > > > (or > > > > a map-side join). I would use a round robin strategy to assign the > > > > arbitrary > > > > partitions. > > > > > > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi all, > > > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > > > dominates the map output, say 99%. So 99% data set will go to one > > > > reducer, > > > > > and this reducer will become the bottleneck. > > > > > > > > > > Does hadoop have any other better ways to handle such imbalanced > data > > > set > > > > ? > > > > > > > > > > > > > > > Jeff Zhang > > > > > > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Runping Qi 2009-11-18, 20:34
Is it true that most of the 17 minutes for the reducer with the 100000 same
keys was taken by the sort phase? If so, that means that the sorting algorithm does not handle the special case well. On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > Hey Todd, > > I will attach dataset and java source used by me. Make sure you use with 10 > reducers and also use partitioner class as I have provided. > > Dataset-1 has smaller key length > Dataset-2 has larger key length > > When I experiment with both dataset , According my partitioner class > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take > maximum amount of time in all reducers.( like 17 mins) where as remaining > all reducers also get 100000 keys but they all are not same , and these > reducers get over in (1 min 30 sec on avg). > > Pankil > > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> >> wrote: >> >> > With respect to Imbalanced data, Can anyone guide me how sorting takes >> > place >> > in Hadoop after Map phase. >> > >> > I did some experiments and found that if there are two reducers which >> have >> > same number of keys to sort and one reducer has all the keys same and >> other >> > have different keys then time taken by by the reducer having all keys >> same >> > is terribly large then other one. >> > >> > >> Hi Pankil, >> >> This is an interesting experiment you've done with results that I wouldn't >> quite expect. Do you have the java source available that you used to run >> this experiment? >> >> >> >> > Also I found that length on my Key doesnt matter in the time taken to >> sort >> > it. >> > >> > >> With small keys on CPU-bound workload this is probably the case since the >> sort would be dominated by comparison. If you were to benchmark keys that >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. >> >> >> > I wanted some hints how sorting is done .. >> > >> >> MapTask.java, ReduceTask.java, and Merger.java are the key places to look. >> The actual sort is a relatively basic quicksort, but there is plenty of >> complexity in the spill/shuffle/merge logic. >> >> -Todd >> >> >> >> > >> > Pankil >> > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] >> > >wrote: >> > >> > > Hey Jeff, >> > > >> > > You may be interested in the Skewed Design specification from the Pig >> > team: >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec. >> > > >> > > Regards, >> > > Jeff >> > > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> >> > wrote: >> > > >> > > > My first thought is that it depends on the reduce logic. If you >> could >> > do >> > > > the >> > > > reduction in two passes then you could do an initial arbitrary >> > partition >> > > > for >> > > > the majority key and bring the partitions together in a second >> > reduction >> > > > (or >> > > > a map-side join). I would use a round robin strategy to assign the >> > > > arbitrary >> > > > partitions. >> > > > >> > > > >> > > > >> > > > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> >> wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > Today there's a problem about imbalanced data come out of mind . >> > > > > >> > > > > I'd like to know how hadoop handle this kind of data. e.g. one >> key >> > > > > dominates the map output, say 99%. So 99% data set will go to one >> > > > reducer, >> > > > > and this reducer will become the bottleneck. >> > > > > >> > > > > Does hadoop have any other better ways to handle such imbalanced >> data >> > > set >> > > > ? >> > > > > >> > > > > >> > > > > Jeff Zhang >> > > > > >> > > > >> > > >> > >> > >
-
Re: How to handle imbalanced data in hadoop ?Pankil Doshi 2009-11-18, 22:53
Ya thats true. Though it depends on my cluster configuration but still other
reducers (0 to 8) also have 100000 keys to handle in which keys are different and they get done in (1 min 30 sec on avg). But 9th reducer getting all 100000 keys same takes 17 mins. Pankil On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <[EMAIL PROTECTED]> wrote: > Is it true that most of the 17 minutes for the reducer with the 100000 same > keys was taken by the sort phase? If so, that means that the sorting > algorithm does not handle the special case well. > > > On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <[EMAIL PROTECTED]> > wrote: > > > Hey Todd, > > > > I will attach dataset and java source used by me. Make sure you use with > 10 > > reducers and also use partitioner class as I have provided. > > > > Dataset-1 has smaller key length > > Dataset-2 has larger key length > > > > When I experiment with both dataset , According my partitioner class > > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it > take > > maximum amount of time in all reducers.( like 17 mins) where as remaining > > all reducers also get 100000 keys but they all are not same , and these > > reducers get over in (1 min 30 sec on avg). > > > > Pankil > > > > > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> > >> wrote: > >> > >> > With respect to Imbalanced data, Can anyone guide me how sorting takes > >> > place > >> > in Hadoop after Map phase. > >> > > >> > I did some experiments and found that if there are two reducers which > >> have > >> > same number of keys to sort and one reducer has all the keys same and > >> other > >> > have different keys then time taken by by the reducer having all keys > >> same > >> > is terribly large then other one. > >> > > >> > > >> Hi Pankil, > >> > >> This is an interesting experiment you've done with results that I > wouldn't > >> quite expect. Do you have the java source available that you used to run > >> this experiment? > >> > >> > >> > >> > Also I found that length on my Key doesnt matter in the time taken to > >> sort > >> > it. > >> > > >> > > >> With small keys on CPU-bound workload this is probably the case since > the > >> sort would be dominated by comparison. If you were to benchmark keys > that > >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a > difference. > >> > >> > >> > I wanted some hints how sorting is done .. > >> > > >> > >> MapTask.java, ReduceTask.java, and Merger.java are the key places to > look. > >> The actual sort is a relatively basic quicksort, but there is plenty of > >> complexity in the spill/shuffle/merge logic. > >> > >> -Todd > >> > >> > >> > >> > > >> > Pankil > >> > > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher < > [EMAIL PROTECTED] > >> > >wrote: > >> > > >> > > Hey Jeff, > >> > > > >> > > You may be interested in the Skewed Design specification from the > Pig > >> > team: > >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > >> > > > >> > > Regards, > >> > > Jeff > >> > > > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > >> > wrote: > >> > > > >> > > > My first thought is that it depends on the reduce logic. If you > >> could > >> > do > >> > > > the > >> > > > reduction in two passes then you could do an initial arbitrary > >> > partition > >> > > > for > >> > > > the majority key and bring the partitions together in a second > >> > reduction > >> > > > (or > >> > > > a map-side join). I would use a round robin strategy to assign the > >> > > > arbitrary > >> > > > partitions. > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> > >> wrote: > >> > > > > >> > > > > Hi all, > >> > > > > > >> > > > > Today there's a problem about imbalanced data come out of mind . > >> > > > > > >> > > > > I'd like to know how hadoop handle this kind of data. e.g. one
-
Re: How to handle imbalanced data in hadoop ?Todd Lipcon 2009-11-18, 22:55
Hi Pankil,
Thanks for sending these along. I'll try to block out some time this week to take a look. -Todd On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > Hey Todd, > > I will attach dataset and java source used by me. Make sure you use with 10 > reducers and also use partitioner class as I have provided. > > Dataset-1 has smaller key length > Dataset-2 has larger key length > > When I experiment with both dataset , According my partitioner class > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it take > maximum amount of time in all reducers.( like 17 mins) where as remaining > all reducers also get 100000 keys but they all are not same , and these > reducers get over in (1 min 30 sec on avg). > > Pankil > > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> >> wrote: >> >> > With respect to Imbalanced data, Can anyone guide me how sorting takes >> > place >> > in Hadoop after Map phase. >> > >> > I did some experiments and found that if there are two reducers which >> have >> > same number of keys to sort and one reducer has all the keys same and >> other >> > have different keys then time taken by by the reducer having all keys >> same >> > is terribly large then other one. >> > >> > >> Hi Pankil, >> >> This is an interesting experiment you've done with results that I wouldn't >> quite expect. Do you have the java source available that you used to run >> this experiment? >> >> >> >> > Also I found that length on my Key doesnt matter in the time taken to >> sort >> > it. >> > >> > >> With small keys on CPU-bound workload this is probably the case since the >> sort would be dominated by comparison. If you were to benchmark keys that >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a difference. >> >> >> > I wanted some hints how sorting is done .. >> > >> >> MapTask.java, ReduceTask.java, and Merger.java are the key places to look. >> The actual sort is a relatively basic quicksort, but there is plenty of >> complexity in the spill/shuffle/merge logic. >> >> -Todd >> >> >> >> > >> > Pankil >> > >> > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] >> > >wrote: >> > >> > > Hey Jeff, >> > > >> > > You may be interested in the Skewed Design specification from the Pig >> > team: >> > > http://wiki.apache.org/pig/PigSkewedJoinSpec. >> > > >> > > Regards, >> > > Jeff >> > > >> > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> >> > wrote: >> > > >> > > > My first thought is that it depends on the reduce logic. If you >> could >> > do >> > > > the >> > > > reduction in two passes then you could do an initial arbitrary >> > partition >> > > > for >> > > > the majority key and bring the partitions together in a second >> > reduction >> > > > (or >> > > > a map-side join). I would use a round robin strategy to assign the >> > > > arbitrary >> > > > partitions. >> > > > >> > > > >> > > > >> > > > >> > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> >> wrote: >> > > > >> > > > > Hi all, >> > > > > >> > > > > Today there's a problem about imbalanced data come out of mind . >> > > > > >> > > > > I'd like to know how hadoop handle this kind of data. e.g. one >> key >> > > > > dominates the map output, say 99%. So 99% data set will go to one >> > > > reducer, >> > > > > and this reducer will become the bottleneck. >> > > > > >> > > > > Does hadoop have any other better ways to handle such imbalanced >> data >> > > set >> > > > ? >> > > > > >> > > > > >> > > > > Jeff Zhang >> > > > > >> > > > >> > > >> > >> > >
-
Re: How to handle imbalanced data in hadoop ?Todd Lipcon 2009-11-24, 05:32
Interesting. I don't have the 17 minutes issue, but the reducer with the
identical keys is taking about twice as long as the others. Looking at counters, most of the tasks have "Reduce shuffle bytes" 0, whereas the slow one has reduce shuffle bytes 1,100,006 as expected. Logs on the slow one: 2009-11-23 21:20:48,524 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1100002 bytes 2009-11-23 21:22:53,763 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200911232110_0007_r_000009_0 is done. And is in the process of commiting On the fast one: 2009-11-23 21:20:45,370 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1100002 bytes 2009-11-23 21:21:27,535 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200911232110_0007_r_000008_0 is done. And is in the process of commiting So something about the merge is slow when all keys are identical, perhaps. Which is strange, because this isn't even much of a "merge" - there was only one mapper. I have a plane flight tonight, will try to take a deeper look. -Todd On Wed, Nov 18, 2009 at 2:53 PM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > Ya thats true. Though it depends on my cluster configuration but still > other > reducers (0 to 8) also have 100000 keys to handle in which keys are > different and they get done in (1 min 30 sec on avg). > But 9th reducer getting all 100000 keys same takes 17 mins. > > Pankil > > On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <[EMAIL PROTECTED]> wrote: > > > Is it true that most of the 17 minutes for the reducer with the 100000 > same > > keys was taken by the sort phase? If so, that means that the sorting > > algorithm does not handle the special case well. > > > > > > On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <[EMAIL PROTECTED]> > > wrote: > > > > > Hey Todd, > > > > > > I will attach dataset and java source used by me. Make sure you use > with > > 10 > > > reducers and also use partitioner class as I have provided. > > > > > > Dataset-1 has smaller key length > > > Dataset-2 has larger key length > > > > > > When I experiment with both dataset , According my partitioner class > > > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it > > take > > > maximum amount of time in all reducers.( like 17 mins) where as > remaining > > > all reducers also get 100000 keys but they all are not same , and these > > > reducers get over in (1 min 30 sec on avg). > > > > > > Pankil > > > > > > > > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <[EMAIL PROTECTED]> > wrote: > > > > > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> > > >> wrote: > > >> > > >> > With respect to Imbalanced data, Can anyone guide me how sorting > takes > > >> > place > > >> > in Hadoop after Map phase. > > >> > > > >> > I did some experiments and found that if there are two reducers > which > > >> have > > >> > same number of keys to sort and one reducer has all the keys same > and > > >> other > > >> > have different keys then time taken by by the reducer having all > keys > > >> same > > >> > is terribly large then other one. > > >> > > > >> > > > >> Hi Pankil, > > >> > > >> This is an interesting experiment you've done with results that I > > wouldn't > > >> quite expect. Do you have the java source available that you used to > run > > >> this experiment? > > >> > > >> > > >> > > >> > Also I found that length on my Key doesnt matter in the time taken > to > > >> sort > > >> > it. > > >> > > > >> > > > >> With small keys on CPU-bound workload this is probably the case since > > the > > >> sort would be dominated by comparison. If you were to benchmark keys > > that > > >> are 10 bytes vs keys that are 1000 bytes, I'm sure you'd see a > > difference. > > >> > > >> > > >> > I wanted some hints how sorting is done .. > > >> > > > >> > > >> MapTask.java, ReduceTask.java, and Merger.java are the key places to > > look.
-
Re: How to handle imbalanced data in hadoop ?Todd Lipcon 2009-11-24, 07:14
Oops - I mistakenly assumed the test Reducer was just some kind of
wordcount-esque summer. In fact, it has an O(n^2) operation, essentially: sValue += values.next().toString() + '\t'; Appending to a string like this is very slow, and explains why the reducers that get a lot of the same key are way slower. It's allocating lots of excess objects, and doing a ton of memory copies. Switching to a stringbuffer fixed the whole thing: - String sValue=""; + StringBuffer sb = new StringBuffer(); while (values.hasNext()) { iCount++; - sValue += values.next().toString() + '\t'; + sb.append(values.next().toString()).append('\t'); } Hope that helps, Pankil. -Todd On Mon, Nov 23, 2009 at 9:32 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Interesting. I don't have the 17 minutes issue, but the reducer with the > identical keys is taking about twice as long as the others. > > Looking at counters, most of the tasks have "Reduce shuffle bytes" 0, > whereas the slow one has reduce shuffle bytes 1,100,006 as expected. > > Logs on the slow one: > > 2009-11-23 21:20:48,524 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1100002 bytes > 2009-11-23 21:22:53,763 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200911232110_0007_r_000009_0 is done. And is in the process of commiting > > > On the fast one: > > 2009-11-23 21:20:45,370 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 1100002 bytes > 2009-11-23 21:21:27,535 INFO org.apache.hadoop.mapred.TaskRunner: Task:attempt_200911232110_0007_r_000008_0 is done. And is in the process of commiting > > > So something about the merge is slow when all keys are identical, perhaps. > Which is strange, because this isn't even much of a "merge" - there was only > one mapper. > > I have a plane flight tonight, will try to take a deeper look. > > -Todd > > > > On Wed, Nov 18, 2009 at 2:53 PM, Pankil Doshi <[EMAIL PROTECTED]> wrote: > >> Ya thats true. Though it depends on my cluster configuration but still >> other >> reducers (0 to 8) also have 100000 keys to handle in which keys are >> different and they get done in (1 min 30 sec on avg). >> But 9th reducer getting all 100000 keys same takes 17 mins. >> >> Pankil >> >> On Wed, Nov 18, 2009 at 3:34 PM, Runping Qi <[EMAIL PROTECTED]> wrote: >> >> > Is it true that most of the 17 minutes for the reducer with the 100000 >> same >> > keys was taken by the sort phase? If so, that means that the sorting >> > algorithm does not handle the special case well. >> > >> > >> > On Wed, Nov 18, 2009 at 11:16 AM, Pankil Doshi <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Hey Todd, >> > > >> > > I will attach dataset and java source used by me. Make sure you use >> with >> > 10 >> > > reducers and also use partitioner class as I have provided. >> > > >> > > Dataset-1 has smaller key length >> > > Dataset-2 has larger key length >> > > >> > > When I experiment with both dataset , According my partitioner class >> > > Reducer 9 (i.e 10 if start with 1) gets all 100000 keys same and so it >> > take >> > > maximum amount of time in all reducers.( like 17 mins) where as >> remaining >> > > all reducers also get 100000 keys but they all are not same , and >> these >> > > reducers get over in (1 min 30 sec on avg). >> > > >> > > Pankil >> > > >> > > >> > > On Tue, Nov 17, 2009 at 5:07 PM, Todd Lipcon <[EMAIL PROTECTED]> >> wrote: >> > > >> > >> On Tue, Nov 17, 2009 at 1:54 PM, Pankil Doshi <[EMAIL PROTECTED]> >> > >> wrote: >> > >> >> > >> > With respect to Imbalanced data, Can anyone guide me how sorting >> takes >> > >> > place >> > >> > in Hadoop after Map phase. >> > >> > >> > >> > I did some experiments and found that if there are two reducers >> which >> > >> have >> > >> > same number of keys to sort and one reducer has all the keys same
-
Re: How to handle imbalanced data in hadoop ?Ted Xu 2009-11-24, 15:35
2009/11/18 Pankil Doshi <[EMAIL PROTECTED]>
> With respect to Imbalanced data, Can anyone guide me how sorting takes > place > in Hadoop after Map phase. > > I did some experiments and found that if there are two reducers which have > same number of keys to sort and one reducer has all the keys same and other > have different keys then time taken by by the reducer having all keys same > is terribly large then other one. > Maybe that is caused by the sorting algorithm. It is make sence if we consider the default sort algorithm is quick sort. In quick sort if all the sorting data is quite the same, sort iteration will be very deep. Would you please try changing the job configuration "map.sort.class" to be org.apache.hadoop.util.HeapSort? > Also I found that length on my Key doesnt matter in the time taken to sort > it. > > I wanted some hints how sorting is done .. > > Pankil > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > >wrote: > > > Hey Jeff, > > > > You may be interested in the Skewed Design specification from the Pig > team: > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > Regards, > > Jeff > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > wrote: > > > > > My first thought is that it depends on the reduce logic. If you could > do > > > the > > > reduction in two passes then you could do an initial arbitrary > partition > > > for > > > the majority key and bring the partitions together in a second > reduction > > > (or > > > a map-side join). I would use a round robin strategy to assign the > > > arbitrary > > > partitions. > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > > > > > Hi all, > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > > dominates the map output, say 99%. So 99% data set will go to one > > > reducer, > > > > and this reducer will become the bottleneck. > > > > > > > > Does hadoop have any other better ways to handle such imbalanced data > > set > > > ? > > > > > > > > > > > > Jeff Zhang > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Jeff Zhang 2009-11-24, 16:09
agree, the worst case for quick sort is O(n2) if all the data is the same.
and in this case the running for heapsort is O(n*log n) Jeff Zhang On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <[EMAIL PROTECTED]> wrote: > 2009/11/18 Pankil Doshi <[EMAIL PROTECTED]> > > > With respect to Imbalanced data, Can anyone guide me how sorting takes > > place > > in Hadoop after Map phase. > > > > I did some experiments and found that if there are two reducers which > have > > same number of keys to sort and one reducer has all the keys same and > other > > have different keys then time taken by by the reducer having all keys > same > > is terribly large then other one. > > > > Maybe that is caused by the sorting algorithm. > > It is make sence if we consider the default sort algorithm is quick sort. > In quick sort if > all the sorting data is quite the same, sort iteration will be very deep. > > Would you please try changing the job configuration "map.sort.class" to be > org.apache.hadoop.util.HeapSort? > > > > Also I found that length on my Key doesnt matter in the time taken to > sort > > it. > > > > I wanted some hints how sorting is done .. > > > > Pankil > > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher <[EMAIL PROTECTED] > > >wrote: > > > > > Hey Jeff, > > > > > > You may be interested in the Skewed Design specification from the Pig > > team: > > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > > > Regards, > > > Jeff > > > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > > wrote: > > > > > > > My first thought is that it depends on the reduce logic. If you could > > do > > > > the > > > > reduction in two passes then you could do an initial arbitrary > > partition > > > > for > > > > the majority key and bring the partitions together in a second > > reduction > > > > (or > > > > a map-side join). I would use a round robin strategy to assign the > > > > arbitrary > > > > partitions. > > > > > > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> > wrote: > > > > > > > > > Hi all, > > > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one key > > > > > dominates the map output, say 99%. So 99% data set will go to one > > > > reducer, > > > > > and this reducer will become the bottleneck. > > > > > > > > > > Does hadoop have any other better ways to handle such imbalanced > data > > > set > > > > ? > > > > > > > > > > > > > > > Jeff Zhang > > > > > > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Todd Lipcon 2009-11-24, 16:18
I don't disagree, but as I noted above, the majority of the issue Pankil saw
was an inefficient reducer that was extremely slow for large value sets for a given key. When I fixed that issue, the sort time was within the noise margin between the "all the same" and "9 different values" reducers. -Todd On Tue, Nov 24, 2009 at 8:09 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > agree, the worst case for quick sort is O(n2) if all the data is the same. > > and in this case the running for heapsort is O(n*log n) > > Jeff Zhang > > > On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <[EMAIL PROTECTED]> wrote: > > > 2009/11/18 Pankil Doshi <[EMAIL PROTECTED]> > > > > > With respect to Imbalanced data, Can anyone guide me how sorting takes > > > place > > > in Hadoop after Map phase. > > > > > > I did some experiments and found that if there are two reducers which > > have > > > same number of keys to sort and one reducer has all the keys same and > > other > > > have different keys then time taken by by the reducer having all keys > > same > > > is terribly large then other one. > > > > > > > Maybe that is caused by the sorting algorithm. > > > > It is make sence if we consider the default sort algorithm is quick sort. > > In quick sort if > > all the sorting data is quite the same, sort iteration will be very deep. > > > > Would you please try changing the job configuration "map.sort.class" to > be > > org.apache.hadoop.util.HeapSort? > > > > > > > Also I found that length on my Key doesnt matter in the time taken to > > sort > > > it. > > > > > > I wanted some hints how sorting is done .. > > > > > > Pankil > > > > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher < > [EMAIL PROTECTED] > > > >wrote: > > > > > > > Hey Jeff, > > > > > > > > You may be interested in the Skewed Design specification from the Pig > > > team: > > > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > > > > > Regards, > > > > Jeff > > > > > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > My first thought is that it depends on the reduce logic. If you > could > > > do > > > > > the > > > > > reduction in two passes then you could do an initial arbitrary > > > partition > > > > > for > > > > > the majority key and bring the partitions together in a second > > > reduction > > > > > (or > > > > > a map-side join). I would use a round robin strategy to assign the > > > > > arbitrary > > > > > partitions. > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > > > Hi all, > > > > > > > > > > > > Today there's a problem about imbalanced data come out of mind . > > > > > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one > key > > > > > > dominates the map output, say 99%. So 99% data set will go to one > > > > > reducer, > > > > > > and this reducer will become the bottleneck. > > > > > > > > > > > > Does hadoop have any other better ways to handle such imbalanced > > data > > > > set > > > > > ? > > > > > > > > > > > > > > > > > > Jeff Zhang > > > > > > > > > > > > > > > > > > > > >
-
Re: How to handle imbalanced data in hadoop ?Pankil Doshi 2009-11-24, 18:35
Thanks a lot Todd.
Its very good suggestion. Will change my reducer code and see how does it goes. Pankil On Tue, Nov 24, 2009 at 11:18 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > I don't disagree, but as I noted above, the majority of the issue Pankil > saw > was an inefficient reducer that was extremely slow for large value sets for > a given key. When I fixed that issue, the sort time was within the noise > margin between the "all the same" and "9 different values" reducers. > > -Todd > > On Tue, Nov 24, 2009 at 8:09 AM, Jeff Zhang <[EMAIL PROTECTED]> wrote: > > > agree, the worst case for quick sort is O(n2) if all the data is the > same. > > > > and in this case the running for heapsort is O(n*log n) > > > > Jeff Zhang > > > > > > On Tue, Nov 24, 2009 at 7:35 AM, Ted Xu <[EMAIL PROTECTED]> wrote: > > > > > 2009/11/18 Pankil Doshi <[EMAIL PROTECTED]> > > > > > > > With respect to Imbalanced data, Can anyone guide me how sorting > takes > > > > place > > > > in Hadoop after Map phase. > > > > > > > > I did some experiments and found that if there are two reducers which > > > have > > > > same number of keys to sort and one reducer has all the keys same and > > > other > > > > have different keys then time taken by by the reducer having all keys > > > same > > > > is terribly large then other one. > > > > > > > > > > Maybe that is caused by the sorting algorithm. > > > > > > It is make sence if we consider the default sort algorithm is quick > sort. > > > In quick sort if > > > all the sorting data is quite the same, sort iteration will be very > deep. > > > > > > Would you please try changing the job configuration "map.sort.class" to > > be > > > org.apache.hadoop.util.HeapSort? > > > > > > > > > > Also I found that length on my Key doesnt matter in the time taken to > > > sort > > > > it. > > > > > > > > I wanted some hints how sorting is done .. > > > > > > > > Pankil > > > > > > > > On Sun, Nov 15, 2009 at 7:25 PM, Jeff Hammerbacher < > > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > Hey Jeff, > > > > > > > > > > You may be interested in the Skewed Design specification from the > Pig > > > > team: > > > > > http://wiki.apache.org/pig/PigSkewedJoinSpec. > > > > > > > > > > Regards, > > > > > Jeff > > > > > > > > > > On Sun, Nov 15, 2009 at 2:00 PM, brien colwell <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > > > My first thought is that it depends on the reduce logic. If you > > could > > > > do > > > > > > the > > > > > > reduction in two passes then you could do an initial arbitrary > > > > partition > > > > > > for > > > > > > the majority key and bring the partitions together in a second > > > > reduction > > > > > > (or > > > > > > a map-side join). I would use a round robin strategy to assign > the > > > > > > arbitrary > > > > > > partitions. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, Nov 14, 2009 at 11:03 PM, Jeff Zhang <[EMAIL PROTECTED]> > > > wrote: > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > Today there's a problem about imbalanced data come out of mind > . > > > > > > > > > > > > > > I'd like to know how hadoop handle this kind of data. e.g. one > > key > > > > > > > dominates the map output, say 99%. So 99% data set will go to > one > > > > > > reducer, > > > > > > > and this reducer will become the bottleneck. > > > > > > > > > > > > > > Does hadoop have any other better ways to handle such > imbalanced > > > data > > > > > set > > > > > > ? > > > > > > > > > > > > > > > > > > > > > Jeff Zhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > |