Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: What happens when you have fewer input files than mapper slots?


Copy link to this message
-
Re: What happens when you have fewer input files than mapper slots?
hari 2013-03-19, 22:15
This may not be what you were looking for, but I was just curious when you
mentioned that
 you would only want to run only one map task because it was cpu intensive.
Well, the map
tasks are supposed to be cpu intensive, isn't it. If the maximum map slots
are 10 then that
would mean you have close to 10 cores available in each node. So, if you
run only one
map task, no matter how much cpu intensive it is, it will only be able to
max out one core, so the
rest of the  9 cores would go under utilized. So, you can still run 9 more
map tasks on that machine.

Or, maybe your node's core count is way less than 10, in which case you
might be better off setting
the mapper slots to a lower value anyway.
On Tue, Mar 19, 2013 at 5:18 PM, jeremy p <[EMAIL PROTECTED]>wrote:

> Thank you for your help.
>
> We're using MRv1.  I've tried setting mapred.tasktracker.map.tasks.maximum
> and mapred.map.tasks, and neither one helped me at all.
>
> Per-job control is definitely what I need.  I need to be able to say, "For
> Job A, only use one mapper per node, but for Job B, use 16 mappers per
> node".  I have not found any way to do this.
>
> I will definitely look into schedulers.  Are there any examples you can
> point me to where someone does what I'm needing to do?
>
> --Jeremy
>
>
> On Tue, Mar 19, 2013 at 2:08 PM, Rahul Jain <[EMAIL PROTECTED]> wrote:
>
>> Which version of hadoop are you using ? MRV1 or MRV2 (yarn) ??
>>
>> For MRv2 (yarn): you can pretty much achieve this using:
>>
>> yarn.nodemanager.resource.memory-mb (system wide setting)
>> and
>> mapreduce.map.memory.mb  (job level setting)
>>
>> e.g. if yarn.nodemanager.resource.memory-mb=100
>> and mapreduce.map.memory.mb= 40
>> a maximum of two mapper can run on a node at any time.
>>
>> For MRv1, The equivalent way will be to control mapper slots on each
>> machine:
>> mapred.tasktracker.map.tasks.maximum,  of course this does not give you
>> 'per job' control. on mappers.
>>
>> In addition in both cases, you can use a scheduler with 'pools / queues'
>> capability in addition to restrict the overall use of grid resource. Do
>> read fair scheduler and capacity scheduler documentation...
>>
>>
>> -Rahul
>>
>>
>>
>>
>> On Tue, Mar 19, 2013 at 1:55 PM, jeremy p <[EMAIL PROTECTED]
>> > wrote:
>>
>>> Short version : let's say you have 20 nodes, and each node has 10 mapper
>>> slots.  You start a job with 20 very small input files.  How is the work
>>> distributed to the cluster?  Will it be even, with each node spawning one
>>> mapper task?  Is there any way of predicting or controlling how the work
>>> will be distributed?
>>>
>>> Long version : My cluster is currently used for two different jobs.  The
>>> cluster is currently optimized for Job A, so each node has a maximum of 18
>>> mapper slots.  However, I also need to run Job B.  Job B is VERY
>>> cpu-intensive, so we really only want one mapper to run on a node at any
>>> given time.  I've done a bunch of research, and it doesn't seem like Hadoop
>>> gives you any way to set the maximum number of mappers per node on a
>>> per-job basis.  I'm at my wit's end here, and considering some rather
>>> egregious workarounds.  If you can think of anything that can help me, I'd
>>> very much appreciate it.
>>>
>>> Thanks!
>>>
>>> --Jeremy
>>>
>>
>>
>