Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS, mail # user - Re: knowing the nodes on which reduce tasks will run


+
Steve Loughran 2012-09-04, 11:33
+
Michael Segel 2012-09-03, 15:59
Copy link to this message
-
Re: knowing the nodes on which reduce tasks will run
Hemanth Yamijala 2012-09-03, 15:56
Hi,

You are right that a change to mapred.tasktracker.reduce.tasks.maximum will
require a restart of the tasktrackers. AFAIK, there is no way of modifying
this property without restarting.

On a different note, could you see if the amount of intermediate data can
be reduced using a combiner, or some other form of local aggregation ?

Thanks
hemanth

On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi <
[EMAIL PROTECTED]> wrote:

> How can I set  'mapred.tasktracker.reduce.tasks.maximum'  to "0" in a
> running tasktracker?
> Seems that I need to restart the tasktracker and in that case I'll loose
> the output of map tasks by particular tasktracker.
>
> Can I change   'mapred.tasktracker.reduce.tasks.maximum'  to "0"  without
> restarting tasktracker?
>
> ~Abhay
>
>
> On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:
>
>> HI Abhay
>>
>> The TaskTrackers on which the reduce tasks are triggered is chosen in
>> random based on the reduce slot availability. So if you don't need the
>> reduce tasks to be scheduled on some particular nodes you need to set
>> 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The
>> bottleneck here is that this property is not a job level one you need to
>> set it on a cluster level.
>>
>> A cleaner approach will be to configure each of your nodes with the right
>> number of map and reduce slots based on the resources available on each
>> machine.
>>
>>
>> On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi <
>> [EMAIL PROTECTED]> wrote:
>>
>>> Hello,
>>>
>>> How can one get to know the nodes on which reduce tasks will run?
>>>
>>> One of my job is running and it's completing all the map tasks.
>>> My map tasks write lots of intermediate data. The intermediate directory
>>> is getting full on all the nodes.
>>> If the reduce task take any node from cluster then It'll try to copy the
>>> data to same disk and it'll eventually fail due to Disk space related
>>> exceptions.
>>>
>>> I have added few more tasktracker nodes in the cluster and now want to
>>> run reducer on new nodes only.
>>> Is it possible to choose a node on which the reducer will run? What's
>>> the algorithm hadoop uses to get a new node to run reducer?
>>>
>>> Thanks in advance.
>>>
>>> Bye
>>> Abhay
>>>
>>
>>
>
+
Abhay Ratnaparkhi 2012-09-03, 16:00