|
|
-
Regarding FIFO scheduler
Praveen Sripati 2011-09-22, 13:05
Hi,
Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map tasks per node) and Hadoop is using the default FIFO scheduler. If I submit first J1 and then J2, will the jobs run in parallel or the job J1 has to be completed before the job J2 starts.
I was reading 'Hadoop - The Definitive Guide' and it says "Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they ran in order of submission, using a FIFO scheduler. Typically, each job would use the whole cluster, so jobs had to wait their turn."
Thanks, Praveen
-
Re: Regarding FIFO scheduler
Joey Echeverria 2011-09-22, 13:23
The jobs would run in parallel since J1 doesn't use all of your map tasks. Things get more interesting with reduce slots. If J1 is an overall slower job, and you haven't configured mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch of idle reduce tasks which would starve J2.
In general, it's best to configure the slow start property and to use the fair scheduler or capacity scheduler.
-Joey
On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati <[EMAIL PROTECTED]> wrote: > Hi, > > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 map > tasks per node) and Hadoop is using the default FIFO scheduler. If I submit > first J1 and then J2, will the jobs run in parallel or the job J1 has to be > completed before the job J2 starts. > > I was reading 'Hadoop - The Definitive Guide' and it says "Early versions > of Hadoop had a very simple approach to scheduling users’ jobs: they ran in > order of submission, using a FIFO scheduler. Typically, each job would use > the whole cluster, so jobs had to wait their turn." > > Thanks, > Praveen >
-- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: Regarding FIFO scheduler
Praveen Sripati 2011-09-22, 13:38
Joey,
Thanks for the response.
'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and says 'Fraction of the number of maps in the job which should be complete before reduces are scheduled for the job.'
Shouldn't the map tasks be completed before the reduce tasks are kicked for a particular job?
Praveen
On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote:
> The jobs would run in parallel since J1 doesn't use all of your map > tasks. Things get more interesting with reduce slots. If J1 is an > overall slower job, and you haven't configured > mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch > of idle reduce tasks which would starve J2. > > In general, it's best to configure the slow start property and to use > the fair scheduler or capacity scheduler. > > -Joey > > On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati > <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map > > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 > map > > tasks per node) and Hadoop is using the default FIFO scheduler. If I > submit > > first J1 and then J2, will the jobs run in parallel or the job J1 has to > be > > completed before the job J2 starts. > > > > I was reading 'Hadoop - The Definitive Guide' and it says "Early > versions > > of Hadoop had a very simple approach to scheduling users’ jobs: they ran > in > > order of submission, using a FIFO scheduler. Typically, each job would > use > > the whole cluster, so jobs had to wait their turn." > > > > Thanks, > > Praveen > > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
-
Re: Regarding FIFO scheduler
Joey Echeverria 2011-09-22, 13:43
In most cases, your job will have more map tasks than map slots. You want the reducers to spin up at some point before all your maps complete, so that the shuffle and sort can work in parallel with some of your map tasks. I usually set slow start to 80%, sometimes higher if I know the maps are slow and they do a lot of filtering, so there isn't too much intermediate data.
-Joey
On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati <[EMAIL PROTECTED]> wrote: > Joey, > > Thanks for the response. > > 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and > says 'Fraction of the number of maps in the job which should be complete > before reduces are scheduled for the job.' > > Shouldn't the map tasks be completed before the reduce tasks are kicked for > a particular job? > > Praveen > > On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote: >> >> The jobs would run in parallel since J1 doesn't use all of your map >> tasks. Things get more interesting with reduce slots. If J1 is an >> overall slower job, and you haven't configured >> mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch >> of idle reduce tasks which would starve J2. >> >> In general, it's best to configure the slow start property and to use >> the fair scheduler or capacity scheduler. >> >> -Joey >> >> On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati >> <[EMAIL PROTECTED]> wrote: >> > Hi, >> > >> > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map >> > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with 10 >> > map >> > tasks per node) and Hadoop is using the default FIFO scheduler. If I >> > submit >> > first J1 and then J2, will the jobs run in parallel or the job J1 has to >> > be >> > completed before the job J2 starts. >> > >> > I was reading 'Hadoop - The Definitive Guide' and it says "Early >> > versions >> > of Hadoop had a very simple approach to scheduling users’ jobs: they ran >> > in >> > order of submission, using a FIFO scheduler. Typically, each job would >> > use >> > the whole cluster, so jobs had to wait their turn." >> > >> > Thanks, >> > Praveen >> > >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 > >
-- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: Regarding FIFO scheduler
Praveen Sripati 2011-09-22, 13:51
Thanks, got the point. So, the shuffle and sort can happen in parallel even before all the map tasks are completed, but the reduce happens only after all the map tasks are complete.
Praveen
On Thu, Sep 22, 2011 at 7:13 PM, Joey Echeverria <[EMAIL PROTECTED]> wrote:
> In most cases, your job will have more map tasks than map slots. You > want the reducers to spin up at some point before all your maps > complete, so that the shuffle and sort can work in parallel with some > of your map tasks. I usually set slow start to 80%, sometimes higher > if I know the maps are slow and they do a lot of filtering, so there > isn't too much intermediate data. > > -Joey > > On Thu, Sep 22, 2011 at 6:38 AM, Praveen Sripati > <[EMAIL PROTECTED]> wrote: > > Joey, > > > > Thanks for the response. > > > > 'mapreduce.job.reduce.slowstart.completedmaps' is default set to 0.05 and > > says 'Fraction of the number of maps in the job which should be complete > > before reduces are scheduled for the job.' > > > > Shouldn't the map tasks be completed before the reduce tasks are kicked > for > > a particular job? > > > > Praveen > > > > On Thu, Sep 22, 2011 at 6:53 PM, Joey Echeverria <[EMAIL PROTECTED]> > wrote: > >> > >> The jobs would run in parallel since J1 doesn't use all of your map > >> tasks. Things get more interesting with reduce slots. If J1 is an > >> overall slower job, and you haven't configured > >> mapred.reduce.slowstart.completed.maps, then J1 could launch a bunch > >> of idle reduce tasks which would starve J2. > >> > >> In general, it's best to configure the slow start property and to use > >> the fair scheduler or capacity scheduler. > >> > >> -Joey > >> > >> On Thu, Sep 22, 2011 at 6:05 AM, Praveen Sripati > >> <[EMAIL PROTECTED]> wrote: > >> > Hi, > >> > > >> > Lets assume that there are two jobs J1 (100 map tasks) and J2 (200 map > >> > tasks) and the cluster has a capacity of 150 map tasks (15 nodes with > 10 > >> > map > >> > tasks per node) and Hadoop is using the default FIFO scheduler. If I > >> > submit > >> > first J1 and then J2, will the jobs run in parallel or the job J1 has > to > >> > be > >> > completed before the job J2 starts. > >> > > >> > I was reading 'Hadoop - The Definitive Guide' and it says "Early > >> > versions > >> > of Hadoop had a very simple approach to scheduling users’ jobs: they > ran > >> > in > >> > order of submission, using a FIFO scheduler. Typically, each job would > >> > use > >> > the whole cluster, so jobs had to wait their turn." > >> > > >> > Thanks, > >> > Praveen > >> > > >> > >> > >> > >> -- > >> Joseph Echeverria > >> Cloudera, Inc. > >> 443.305.9434 > > > > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
|
|