|
|
-
Re: What happens when you have fewer input files than mapper slots?Harsh J 2013-03-20, 00:04
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via https://issues.apache.org/jira/browse/YARN-2 and seems to fit your need exactly. However, looking at that patch, it seems like param-config support for MR apps wasn't added by this so it may require some work before you can easily leverage it in MRv2. On MRv1, you can achieve the per-node memory supply vs. requirement hack Rahul suggested by using the CapacityScheduler instead. It does not have CPU Core based scheduling directly though. On Wed, Mar 20, 2013 at 4:08 AM, jeremy p <[EMAIL PROTECTED]> wrote: > The job we need to run executes some third-party code that utilizes multiple > cores. The only way the job will get done in a timely fashion is if we give > it all the cores available on the machine. This is not a task that can be > split up. > > Yes, I know, it's not ideal, but this is the situation I have to deal with. > > > On Tue, Mar 19, 2013 at 3:15 PM, hari <[EMAIL PROTECTED]> wrote: >> >> This may not be what you were looking for, but I was just curious when you >> mentioned that >> you would only want to run only one map task because it was cpu >> intensive. Well, the map >> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots >> are 10 then that >> would mean you have close to 10 cores available in each node. So, if you >> run only one >> map task, no matter how much cpu intensive it is, it will only be able to >> max out one core, so the >> rest of the 9 cores would go under utilized. So, you can still run 9 more >> map tasks on that machine. >> >> Or, maybe your node's core count is way less than 10, in which case you >> might be better off setting >> the mapper slots to a lower value anyway. >> >> >> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <[EMAIL PROTECTED]> >> wrote: >>> >>> Thank you for your help. >>> >>> We're using MRv1. I've tried setting >>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one >>> helped me at all. >>> >>> Per-job control is definitely what I need. I need to be able to say, >>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per >>> node". I have not found any way to do this. >>> >>> I will definitely look into schedulers. Are there any examples you can >>> point me to where someone does what I'm needing to do? >>> >>> --Jeremy >>> >>> >>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <[EMAIL PROTECTED]> wrote: >>>> >>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ?? >>>> >>>> For MRv2 (yarn): you can pretty much achieve this using: >>>> >>>> yarn.nodemanager.resource.memory-mb (system wide setting) >>>> and >>>> mapreduce.map.memory.mb (job level setting) >>>> >>>> e.g. if yarn.nodemanager.resource.memory-mb=100 >>>> and mapreduce.map.memory.mb= 40 >>>> a maximum of two mapper can run on a node at any time. >>>> >>>> For MRv1, The equivalent way will be to control mapper slots on each >>>> machine: >>>> mapred.tasktracker.map.tasks.maximum, of course this does not give you >>>> 'per job' control. on mappers. >>>> >>>> In addition in both cases, you can use a scheduler with 'pools / queues' >>>> capability in addition to restrict the overall use of grid resource. Do read >>>> fair scheduler and capacity scheduler documentation... >>>> >>>> >>>> -Rahul >>>> >>>> >>>> >>>> >>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p >>>> <[EMAIL PROTECTED]> wrote: >>>>> >>>>> Short version : let's say you have 20 nodes, and each node has 10 >>>>> mapper slots. You start a job with 20 very small input files. How is the >>>>> work distributed to the cluster? Will it be even, with each node spawning >>>>> one mapper task? Is there any way of predicting or controlling how the work >>>>> will be distributed? >>>>> >>>>> Long version : My cluster is currently used for two different jobs. >>>>> The cluster is currently optimized for Job A, so each node has a maximum of Harsh J |