Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Re: What happens when you have fewer input files than mapper slots?


+
Rahul Jain 2013-03-19, 21:08
+
jeremy p 2013-03-19, 21:18
Copy link to this message
-
Re: What happens when you have fewer input files than mapper slots?
You can leverage YARN's CPU Core scheduling feature for this purpose.
It was added to the 2.0.3 release via
https://issues.apache.org/jira/browse/YARN-2 and seems to fit your
need exactly. However, looking at that patch, it seems like
param-config support for MR apps wasn't added by this so it may
require some work before you can easily leverage it in MRv2.

On MRv1, you can achieve the per-node memory supply vs. requirement
hack Rahul suggested by using the CapacityScheduler instead. It does
not have CPU Core based scheduling directly though.

On Wed, Mar 20, 2013 at 4:08 AM, jeremy p
<[EMAIL PROTECTED]> wrote:
> The job we need to run executes some third-party code that utilizes multiple
> cores.  The only way the job will get done in a timely fashion is if we give
> it all the cores available on the machine.  This is not a task that can be
> split up.
>
> Yes, I know, it's not ideal, but this is the situation I have to deal with.
>
>
> On Tue, Mar 19, 2013 at 3:15 PM, hari <[EMAIL PROTECTED]> wrote:
>>
>> This may not be what you were looking for, but I was just curious when you
>> mentioned that
>>  you would only want to run only one map task because it was cpu
>> intensive. Well, the map
>> tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
>> are 10 then that
>> would mean you have close to 10 cores available in each node. So, if you
>> run only one
>> map task, no matter how much cpu intensive it is, it will only be able to
>> max out one core, so the
>> rest of the  9 cores would go under utilized. So, you can still run 9 more
>> map tasks on that machine.
>>
>> Or, maybe your node's core count is way less than 10, in which case you
>> might be better off setting
>> the mapper slots to a lower value anyway.
>>
>>
>> On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <[EMAIL PROTECTED]>
>> wrote:
>>>
>>> Thank you for your help.
>>>
>>> We're using MRv1.  I've tried setting
>>> mapred.tasktracker.map.tasks.maximum and mapred.map.tasks, and neither one
>>> helped me at all.
>>>
>>> Per-job control is definitely what I need.  I need to be able to say,
>>> "For Job A, only use one mapper per node, but for Job B, use 16 mappers per
>>> node".  I have not found any way to do this.
>>>
>>> I will definitely look into schedulers.  Are there any examples you can
>>> point me to where someone does what I'm needing to do?
>>>
>>> --Jeremy
>>>
>>>
>>> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <[EMAIL PROTECTED]> wrote:
>>>>
>>>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>>>
>>>> For MRv2 (yarn): you can pretty much achieve this using:
>>>>
>>>> yarn.nodemanager.resource.memory-mb (system wide setting)
>>>> and
>>>> mapreduce.map.memory.mb  (job level setting)
>>>>
>>>> e.g. if yarn.nodemanager.resource.memory-mb=100
>>>> and mapreduce.map.memory.mb= 40
>>>> a maximum of two mapper can run on a node at any time.
>>>>
>>>> For MRv1, The equivalent way will be to control mapper slots on each
>>>> machine:
>>>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>>>> 'per job' control. on mappers.
>>>>
>>>> In addition in both cases, you can use a scheduler with 'pools / queues'
>>>> capability in addition to restrict the overall use of grid resource. Do read
>>>> fair scheduler and capacity scheduler documentation...
>>>>
>>>>
>>>> -Rahul
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p
>>>> <[EMAIL PROTECTED]> wrote:
>>>>>
>>>>> Short version : let's say you have 20 nodes, and each node has 10
>>>>> mapper slots.  You start a job with 20 very small input files.  How is the
>>>>> work distributed to the cluster?  Will it be even, with each node spawning
>>>>> one mapper task?  Is there any way of predicting or controlling how the work
>>>>> will be distributed?
>>>>>
>>>>> Long version : My cluster is currently used for two different jobs.
>>>>> The cluster is currently optimized for Job A, so each node has a maximum of

Harsh J
+
Harsh J 2013-03-20, 00:05