If you're looking to set a fixed # of maps per job and also control
their parallel distributed execution (by numbers), a Scheduler cannot
solve that for you but may assist in the process.
Setting a specific # of maps in a job to match something is certainly
not a Scheduler's work, as it only deals with what task needs to go
where. For you to control your job's # of maps (i.e. input splits),
tweak your Job's InputFormat#getSplits(…). The size of array it
returns dictates the total number of maps your job ends up running.
You are further limited by the fixed task slot behavior in 0.20.x/1.x
releases which use the MR1 framework (i.e. a JobTracker and a
TaskTracker). The property "mapred.tasktracker.map.tasks.maximum"
applies to a TaskTracker and not a per-job one as it name goes, and
isn't what you'd configure to seemingly achieve what you want.
In addition to this, YARN has a slotless NodeManager, wherein you can
ask for a certain amount of resources from your job on a per-task
level and have it granted globally. Meaning, if your NodeManager got
configured to use upto 8 GB, and your job/app requests 8 GB per
task/container, then only 1 such container can at most be run at one
time on any chosen NodeManager that serves 8 GB of memory resources.
Likewise, if your demand becomes 8/18 GB per container/task, then upto
18 containers can run in parallel at most on a given NM.
This is still not rigid though (less than 18 may run at the same time
on an NM as well, depending on the scheduler's distribution of
containers across all nodes), as that isn't MapReduce's goal in the
first place. If you want more rigidity consider writing your own YARN
application that implements such a distribution goal.
On Sat, Mar 23, 2013 at 3:18 AM, jeremy p
<[EMAIL PROTECTED]> wrote:
> I have two jobs, Job A and Job B. Job A needs to run with 18 mappers per
> machine, Job B needs to run with 1 mapper per machine. Hadoop doesn't give
> you a way to specify number of mappers on a per-job basis.
> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks do absolutely
> nothing. I've been looking into the Capacity Scheduler, but I'm unsure if
> it can help me. In this documentation, all the settings under "Resource
> Allocation" are cluster-wide. I need to be able to set the maximum capacity
> on a given machine. It does look like you have the option to set the
> required amount of memory per slot, but that setting applies to all the
> queues. If I could set that value on a per-queue basis, that would be
> Will the capacity scheduler help me here? Or am I barking up the wrong
> tree? If the capacity scheduler won't help me, can you think of anything
> that will?