Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Re: What happens when you have fewer input files than mapper slots?


Copy link to this message
-
Re: What happens when you have fewer input files than mapper slots?
Harsh J 2013-03-20, 00:04
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
require some work before you can easily leverage it in MRv2.

On MRv1, you can achieve the per-node memory supply vs. requirement
hack Rahul suggested by using the CapacityScheduler instead. It does
not have CPU Core based scheduling directly though.

On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
<[EMAIL PROTECTED]> wrote:
> The job we need to run executes some third-party code that utilizes multiple
> cores.  The only way the job will get done in a timely fashion is if we give
> it all the cores available on the machine.  This is not a task that can be
> split up.
>
> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>
>
> On Tue, Mar 19, 2013 at 3:15 PM, hari <[EMAIL PROTECTED]> wrote:
>>
>> This may not be what you were looking for, but I was just curious when you
>> mentioned that
>>  you would only want to run only one map task because it was cpu
>> intensive. Well, the map
>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>> are 10 then that
>> would mean you have close to 10 cores available in each node. So, if you
>> run only one
>> map task, no matter how much cpu intensive it is, it will only be able to
>> max out one core, so the
>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>> map tasks on that machine.
>>
>> Or, maybe your node's core count is way less than 10, in which case you
>> might be better off setting
>> the mapper slots to a lower value anyway.
>>
>>
>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Thank you for your help.
>>>
>>> We're using MRv1.  I've tried setting
>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>> helped me at all.
>>>
>>> Per-job control is definitely what I need.  I need to be able to say,
>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>> node".  I have not found any way to do this.
>>>
>>> I will definitely look into schedulers.  Are there any examples you can
>>> point me to where someone does what I'm needing to do?
>>>
>>> --Jeremy
>>>
>>>
>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>
>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>
>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>> and
>>>> mapreduce.map.memory.mb  (job level setting)
>>>>
>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>> and mapreduce.map.memory.mb= 40
>>>> a maximum of two mapper can run on a node at any time.
>>>>
>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>> machine:
>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>> 'per job' control. on mappers.
>>>>
>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>> fair scheduler and capacity scheduler documentation...
>>>>
>>>>
>>>> -Rahul
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>> <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>> will be distributed?
>>>>>
>>>>> Long version : My cluster is currently used for two different jobs.
>>>>> The cluster is currently optimized for Job A, so each node has a maximum of

Harsh J