|
|
-
Scheduler recommendation
Bobby Dennett 2010-08-11, 22:05
Hi all,
>From what I've read/seen, it appears that, if not the "default" scheduler, most installations are using Hadoop's Fair Scheduler. Based on features and our requirements, we're leaning towards using the Capacity Scheduler; however, there is some concern that it may not be as "stable" as there doesn't appear to be as much talk about it, compared to the Fair Scheduler.
Has anyone hit any nasty issues with regards to the Capacity Scheduler and, in general, are there any "gotchas" to look out for with either scheduler?
We're ramping up the number of users on our Hadoop clusters, particularly in regards to Hive. Our goal is to ensure that production processes continue to run with a majority of the cluster during peak usage times, while personal users share the remaining capacity. The Capacity Scheduler's support of queues and for memory-intensive jobs is appealing but we are curious about drawbacks and/or potential issues.
Thanks in advance, -Bobby
-
Re: Scheduler recommendation
Allen Wittenauer 2010-08-11, 22:38
On Aug 11, 2010, at 3:05 PM, Bobby Dennett wrote: > Has anyone hit any nasty issues with regards to the Capacity Scheduler > and, in general, are there any "gotchas" to look out for with either > scheduler?
We've been running Capacity for about 8 months now on all of our grids. In order to make it usable, we did apply two (all server-side) patches to Apache 0.20.2:
MAPREDUCE-1105 MAPREDUCE-1160
> We're ramping up the number of users on our Hadoop clusters, > particularly in regards to Hive. Our goal is to ensure that production > processes continue to run with a majority of the cluster during peak > usage times, while personal users share the remaining capacity. The > Capacity Scheduler's support of queues and for memory-intensive jobs > is appealing but we are curious about drawbacks and/or potential > issues.
Our experiences with Fair Share (as shipped stock w/0.20.0) were pretty horrific in this type of environment. So we went to Capacity w/the above (relatively simple) patches for the same reasons. Although, we don't have the memory management features since that was a more invasive backport from 0.21.
We have two queues: default and marathon. Marathon is limited to 65-75% of the grid (depending upon which grid) and is used for long running or massive jobs. The default queue is unlimited and is used primarily by short/ad hoc jobs. Other than misbehaving/ill-informed users submitting jobs to default that should go to marathon, it has been working fairly well for us.
One of the key things that is worth mentioning is that you will want to pay attention to your slowstart setting. The default is *way* too low for a shared grid. We were running with .55 up until today with great results. We're currently trying out .80 due an influx of huge, poorly tuned/written jobs that don't produce that much intermediate output but with long running reduces.
The only big con has been the 'unique snowflake' syndrome amongst some of the users/depts when their jobs aren't given a job slot immediately. But that is more political than practical. Total cluster throughput is a lot higher since no pre-emption == tasks run to completion rather than having to re-do the work that was already done.
-
Re: Scheduler recommendation
Edward Capriolo 2010-08-11, 22:47
Sorry I can not speak for the capacity scheduler. We use the fair share and I just modified the configuration so I figured I would chime in.
We have the same use case as you production map/reduce jobs production hive jobs, as well as ad hoc hive jobs.
We broke our jobs into two classes: those that can not be preempted those that can be preemted.
Call them production and adhoc.
In our hive conf we set the pool.name to adhoc. In this way by default hive jobs enter the adhoc pool. If a hive job is a production job we do: Set pool.name=production; hql
We could theoreticall create more adhoc pools but for our use case we only care that our production jobs are never preempted and they have a pool of resources dedicated.
Regards, Edward
On Aug 11, 2010 6:05 PM, "Bobby Dennett" <bdennett+[EMAIL PROTECTED]<bdennett%[EMAIL PROTECTED]>> wrote:
Hi all,
>From what I've read/seen, it appears that, if not the "default" scheduler, most installations are using Hadoop's Fair Scheduler. Based on features and our requirements, we're leaning towards using the Capacity Scheduler; however, there is some concern that it may not be as "stable" as there doesn't appear to be as much talk about it, compared to the Fair Scheduler.
Has anyone hit any nasty issues with regards to the Capacity Scheduler and, in general, are there any "gotchas" to look out for with either scheduler?
We're ramping up the number of users on our Hadoop clusters, particularly in regards to Hive. Our goal is to ensure that production processes continue to run with a majority of the cluster during peak usage times, while personal users share the remaining capacity. The Capacity Scheduler's support of queues and for memory-intensive jobs is appealing but we are curious about drawbacks and/or potential issues.
Thanks in advance, -Bobby
-
Re: Scheduler recommendation
Hemanth Yamijala 2010-08-12, 05:01
Hi,
On Thu, Aug 12, 2010 at 3:35 AM, Bobby Dennett <bdennett+[EMAIL PROTECTED]> wrote: > From what I've read/seen, it appears that, if not the "default" > scheduler, most installations are using Hadoop's Fair Scheduler. Based > on features and our requirements, we're leaning towards using the > Capacity Scheduler; however, there is some concern that it may not be > as "stable" as there doesn't appear to be as much talk about it, > compared to the Fair Scheduler. > > Has anyone hit any nasty issues with regards to the Capacity Scheduler > and, in general, are there any "gotchas" to look out for with either > scheduler? > > We're ramping up the number of users on our Hadoop clusters, > particularly in regards to Hive. Our goal is to ensure that production > processes continue to run with a majority of the cluster during peak > usage times, while personal users share the remaining capacity. The > Capacity Scheduler's support of queues and for memory-intensive jobs > is appealing but we are curious about drawbacks and/or potential > issues.
FWIW, Yahoo! is running capacity scheduler for a reasonably long time now. However, there have been many patches on top of the base Hadoop 0.20.2 version to capacity scheduler that make it 'stable' and work at large scale effectively. Looking at the change log of the yahoo hadoop distribution could possibly give an idea of which patches are useful to pick up and apply to an older version. The good news is that most of these patches have 0.20 versions that are available on JIRA and would apply reasonably cleanly.
> > Thanks in advance, > -Bobby >
-
Re: Scheduler recommendation
Hemanth Yamijala 2010-08-12, 05:30
Hi,
On Thu, Aug 12, 2010 at 10:31 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: > Hi, > > On Thu, Aug 12, 2010 at 3:35 AM, Bobby Dennett > <bdennett+[EMAIL PROTECTED]> wrote: >> From what I've read/seen, it appears that, if not the "default" >> scheduler, most installations are using Hadoop's Fair Scheduler. Based >> on features and our requirements, we're leaning towards using the >> Capacity Scheduler; however, there is some concern that it may not be >> as "stable" as there doesn't appear to be as much talk about it, >> compared to the Fair Scheduler. >> >> Has anyone hit any nasty issues with regards to the Capacity Scheduler >> and, in general, are there any "gotchas" to look out for with either >> scheduler? >> >> We're ramping up the number of users on our Hadoop clusters, >> particularly in regards to Hive. Our goal is to ensure that production >> processes continue to run with a majority of the cluster during peak >> usage times, while personal users share the remaining capacity. The >> Capacity Scheduler's support of queues and for memory-intensive jobs >> is appealing but we are curious about drawbacks and/or potential >> issues. > > FWIW, Yahoo! is running capacity scheduler for a reasonably long time > now. However, there have been many patches on top of the base Hadoop > 0.20.2 version to capacity scheduler that make it 'stable' and work at > large scale effectively. Looking at the change log of the yahoo hadoop > distribution could possibly give an idea of which patches are useful > to pick up and apply to an older version. The good news is that most > of these patches have 0.20 versions that are available on JIRA and > would apply reasonably cleanly. >
Allen cautions the part about patches applying cleanly to 0.20 might not be very true. Thanks for that heads-up, Allen !
Thanks Hemanth
-
Re: Scheduler recommendation
patek tek 2010-08-12, 12:55
I have been trying to post to the mailing list without success since yesterday. This is yet another attempt. On Thu, Aug 12, 2010 at 1:30 AM, Hemanth Yamijala <[EMAIL PROTECTED]>wrote:
> Hi, > > On Thu, Aug 12, 2010 at 10:31 AM, Hemanth Yamijala <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > On Thu, Aug 12, 2010 at 3:35 AM, Bobby Dennett > > <bdennett+[EMAIL PROTECTED] <bdennett%[EMAIL PROTECTED]>> wrote: > >> From what I've read/seen, it appears that, if not the "default" > >> scheduler, most installations are using Hadoop's Fair Scheduler. Based > >> on features and our requirements, we're leaning towards using the > >> Capacity Scheduler; however, there is some concern that it may not be > >> as "stable" as there doesn't appear to be as much talk about it, > >> compared to the Fair Scheduler. > >> > >> Has anyone hit any nasty issues with regards to the Capacity Scheduler > >> and, in general, are there any "gotchas" to look out for with either > >> scheduler? > >> > >> We're ramping up the number of users on our Hadoop clusters, > >> particularly in regards to Hive. Our goal is to ensure that production > >> processes continue to run with a majority of the cluster during peak > >> usage times, while personal users share the remaining capacity. The > >> Capacity Scheduler's support of queues and for memory-intensive jobs > >> is appealing but we are curious about drawbacks and/or potential > >> issues. > > > > FWIW, Yahoo! is running capacity scheduler for a reasonably long time > > now. However, there have been many patches on top of the base Hadoop > > 0.20.2 version to capacity scheduler that make it 'stable' and work at > > large scale effectively. Looking at the change log of the yahoo hadoop > > distribution could possibly give an idea of which patches are useful > > to pick up and apply to an older version. The good news is that most > > of these patches have 0.20 versions that are available on JIRA and > > would apply reasonably cleanly. > > > > Allen cautions the part about patches applying cleanly to 0.20 might > not be very true. Thanks for that heads-up, Allen ! > > Thanks > Hemanth >
|
|