|
|
-
When does Reduce job start
sagar naik 2011-01-04, 18:53
Hi All,
number of map task: 1000s number of reduce task:single digit
In such cases the reduce task wont started even when few map task are completed. Example: In my observation of a sample run of bin/hadoop jar hadoop-*examples*.jar pi 10000 10, the reduce did not start untill 90% of map task were complete.
The only reason, I can think of not starting a reduce task is to avoid the un-necessary transfer of map output data in case of failures. Is there a way to quickly start the reduce task in such case ? Wht is the configuration param to change this behavior
Thanks, Sagar
+
sagar naik 2011-01-04, 18:53
-
Re: When does Reduce job start
Allen Wittenauer 2011-01-04, 20:16
On Jan 4, 2011, at 10:53 AM, sagar naik wrote: > > The only reason, I can think of not starting a reduce task is to > avoid the un-necessary transfer of map output data in case of > failures. Reduce tasks also eat slots while doing the map output. On shared grids, this can be extremely bad behavior. > Is there a way to quickly start the reduce task in such case ? > Wht is the configuration param to change this behavior mapred.reduce.slowstart.completed.maps See http://wiki.apache.org/hadoop/LimitingTaskSlotUsage (from the FAQ 2.12/2.13 questions).
+
Allen Wittenauer 2011-01-04, 20:16
-
Re: When does Reduce job start
Jeff Bean 2011-01-04, 23:14
It's part of the design that reduce() does not get called until the map phase is complete. You're seeing reduce report as started when map is at 90% complete because hadoop is shuffling data from the mappers that have completed. As currently designed, you can't prematurely start reduce() because there is no way to gaurantee you have all the values for any key until all the mappers are done. reduce() requires a key and all the values for that key in order to execute.
Jeff On Tue, Jan 4, 2011 at 10:53 AM, sagar naik <[EMAIL PROTECTED]> wrote:
> Hi All, > > number of map task: 1000s > number of reduce task:single digit > > In such cases the reduce task wont started even when few map task are > completed. > Example: > In my observation of a sample run of bin/hadoop jar > hadoop-*examples*.jar pi 10000 10, the reduce did not start untill 90% > of map task were complete. > > The only reason, I can think of not starting a reduce task is to > avoid the un-necessary transfer of map output data in case of > failures. > > > Is there a way to quickly start the reduce task in such case ? > Wht is the configuration param to change this behavior > > > > Thanks, > Sagar >
+
Jeff Bean 2011-01-04, 23:14
-
Re: When does Reduce job start
sagar naik 2011-01-05, 01:14
Hi Jeff,
To be clear on my end I m not talking abt reduce () function call but spawning of reduce process/task itself To rephrase: Reduce Process/Task is not started untill 90% of map task are done -Sagar On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean <[EMAIL PROTECTED]> wrote: > It's part of the design that reduce() does not get called until the map > phase is complete. You're seeing reduce report as started when map is at 90% > complete because hadoop is shuffling data from the mappers that have > completed. As currently designed, you can't prematurely start reduce() > because there is no way to gaurantee you have all the values for any key > until all the mappers are done. reduce() requires a key and all the values > for that key in order to execute. > > Jeff > > > On Tue, Jan 4, 2011 at 10:53 AM, sagar naik <[EMAIL PROTECTED]> wrote: > >> Hi All, >> >> number of map task: 1000s >> number of reduce task:single digit >> >> In such cases the reduce task wont started even when few map task are >> completed. >> Example: >> In my observation of a sample run of bin/hadoop jar >> hadoop-*examples*.jar pi 10000 10, the reduce did not start untill 90% >> of map task were complete. >> >> The only reason, I can think of not starting a reduce task is to >> avoid the un-necessary transfer of map output data in case of >> failures. >> >> >> Is there a way to quickly start the reduce task in such case ? >> Wht is the configuration param to change this behavior >> >> >> >> Thanks, >> Sagar >> >
+
sagar naik 2011-01-05, 01:14
-
Re: When does Reduce job start
James Seigel 2011-01-05, 01:18
As the other gentleman said. The reduce task kinda needs to know all the data is available before doing its work.
By design.
Cheers James
Sent from my mobile. Please excuse the typos.
On 2011-01-04, at 6:14 PM, sagar naik <[EMAIL PROTECTED]> wrote:
> Hi Jeff, > > To be clear on my end I m not talking abt reduce () function call but > spawning of reduce process/task itself > To rephrase: > Reduce Process/Task is not started untill 90% of map task are done > > > -Sagar > On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean <[EMAIL PROTECTED]> wrote: >> It's part of the design that reduce() does not get called until the map >> phase is complete. You're seeing reduce report as started when map is at 90% >> complete because hadoop is shuffling data from the mappers that have >> completed. As currently designed, you can't prematurely start reduce() >> because there is no way to gaurantee you have all the values for any key >> until all the mappers are done. reduce() requires a key and all the values >> for that key in order to execute. >> >> Jeff >> >> >> On Tue, Jan 4, 2011 at 10:53 AM, sagar naik <[EMAIL PROTECTED]> wrote: >> >>> Hi All, >>> >>> number of map task: 1000s >>> number of reduce task:single digit >>> >>> In such cases the reduce task wont started even when few map task are >>> completed. >>> Example: >>> In my observation of a sample run of bin/hadoop jar >>> hadoop-*examples*.jar pi 10000 10, the reduce did not start untill 90% >>> of map task were complete. >>> >>> The only reason, I can think of not starting a reduce task is to >>> avoid the un-necessary transfer of map output data in case of >>> failures. >>> >>> >>> Is there a way to quickly start the reduce task in such case ? >>> Wht is the configuration param to change this behavior >>> >>> >>> >>> Thanks, >>> Sagar >>> >>
+
James Seigel 2011-01-05, 01:18
-
Re: When does Reduce job start
Harsh J 2011-01-05, 03:23
Hello Sagar,
On Wed, Jan 5, 2011 at 6:44 AM, sagar naik <[EMAIL PROTECTED]> wrote: >>> Wht is the configuration param to change this behavior
mapred.reduce.slowstart.completed.maps is a property (0.20.x) that controls "when" the ReduceTasks have to start getting scheduled. Your job would still need free reduce slots for it to begin.
-- Harsh J www.harshj.com
+
Harsh J 2011-01-05, 03:23
-
Re: When does Reduce job start
sagar naik 2011-01-05, 06:40
Tht is wht I was looking for Thanks a mil harsh Kool , now tht I have a start point, I will check it in hadoop 18
-Sagar
On Tue, Jan 4, 2011 at 7:23 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Hello Sagar, > > On Wed, Jan 5, 2011 at 6:44 AM, sagar naik <[EMAIL PROTECTED]> wrote: >>>> Wht is the configuration param to change this behavior > > mapred.reduce.slowstart.completed.maps is a property (0.20.x) that > controls "when" the ReduceTasks have to start getting scheduled. Your > job would still need free reduce slots for it to begin. > > -- > Harsh J > www.harshj.com >
+
sagar naik 2011-01-05, 06:40
|
|