Sorry I can not speak for the capacity scheduler. We use the fair share and
I just modified the configuration so I figured I would chime in.
We have the same use case as you production map/reduce jobs production hive
jobs, as well as ad hoc hive jobs.
We broke our jobs into two classes:
those that can not be preempted
those that can be preemted.
Call them production and adhoc.
In our hive conf we set the pool.name to adhoc. In this way by default hive
jobs enter the adhoc pool. If a hive job is a production job we do:
Set pool.name=production; hql
We could theoreticall create more adhoc pools but for our use case we only
care that our production jobs are never preempted and they have a pool of
On Aug 11, 2010 6:05 PM, "Bobby Dennett"
<bdennett+[EMAIL PROTECTED]<bdennett%[EMAIL PROTECTED]>>
>From what I've read/seen, it appears that, if not the "default"
scheduler, most installations are using Hadoop's Fair Scheduler. Based
on features and our requirements, we're leaning towards using the
Capacity Scheduler; however, there is some concern that it may not be
as "stable" as there doesn't appear to be as much talk about it,
compared to the Fair Scheduler.
Has anyone hit any nasty issues with regards to the Capacity Scheduler
and, in general, are there any "gotchas" to look out for with either
We're ramping up the number of users on our Hadoop clusters,
particularly in regards to Hive. Our goal is to ensure that production
processes continue to run with a majority of the cluster during peak
usage times, while personal users share the remaining capacity. The
Capacity Scheduler's support of queues and for memory-intensive jobs
is appealing but we are curious about drawbacks and/or potential
Thanks in advance,