|
|
-
Re: What happens when you have fewer input files than mapper slots?Rahul Jain 2013-03-19, 21:08
Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
For MRv2 (yarn): you can pretty much achieve this using: yarn.nodemanager.resource.memory-mb (system wide setting) and mapreduce.map.memory.mb (job level setting) e.g. if yarn.nodemanager.resource.memory-mb=100 and mapreduce.map.memory.mb= 40 a maximum of two mapper can run on a node at any time. For MRv1, The equivalent way will be to control mapper slots on each machine: mapred.tasktracker.map.tasks.maximum, of course this does not give you 'per job' control. on mappers. In addition in both cases, you can use a scheduler with 'pools / queues' capability in addition to restrict the overall use of grid resource. Do read fair scheduler and capacity scheduler documentation... -Rahul On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <[EMAIL PROTECTED]>wrote: > Short version : let's say you have 20 nodes, and each node has 10 mapper > slots. You start a job with 20 very small input files. How is the work > distributed to the cluster? Will it be even, with each node spawning one > mapper task? Is there any way of predicting or controlling how the work > will be distributed? > > Long version : My cluster is currently used for two different jobs. The > cluster is currently optimized for Job A, so each node has a maximum of 18 > mapper slots. However, I also need to run Job B. Job B is VERY > cpu-intensive, so we really only want one mapper to run on a node at any > given time. I've done a bunch of research, and it doesn't seem like Hadoop > gives you any way to set the maximum number of mappers per node on a > per-job basis. I'm at my wit's end here, and considering some rather > egregious workarounds. If you can think of anything that can help me, I'd > very much appreciate it. > > Thanks! > > --Jeremy > |