|
|
-
knowing the nodes on which reduce tasks will run
Abhay Ratnaparkhi 2012-09-03, 14:19
Hello,
How can one get to know the nodes on which reduce tasks will run?
One of my job is running and it's completing all the map tasks. My map tasks write lots of intermediate data. The intermediate directory is getting full on all the nodes. If the reduce task take any node from cluster then It'll try to copy the data to same disk and it'll eventually fail due to Disk space related exceptions.
I have added few more tasktracker nodes in the cluster and now want to run reducer on new nodes only. Is it possible to choose a node on which the reducer will run? What's the algorithm hadoop uses to get a new node to run reducer?
Thanks in advance.
Bye Abhay
+
Abhay Ratnaparkhi 2012-09-03, 14:19
-
Re: knowing the nodes on which reduce tasks will run
Bertrand Dechoux 2012-09-03, 15:24
Hi,
The reducer is run where there is slot available, the location is not related to where the data is located and it is not possible to choose where the reducer will run (except by tweaking the tasktracker...).
Regards
Bertrand
On Mon, Sep 3, 2012 at 4:19 PM, Abhay Ratnaparkhi < [EMAIL PROTECTED]> wrote:
> Hello, > > How can one get to know the nodes on which reduce tasks will run? > > One of my job is running and it's completing all the map tasks. > My map tasks write lots of intermediate data. The intermediate directory > is getting full on all the nodes. > If the reduce task take any node from cluster then It'll try to copy the > data to same disk and it'll eventually fail due to Disk space related > exceptions. > > I have added few more tasktracker nodes in the cluster and now want to run > reducer on new nodes only. > Is it possible to choose a node on which the reducer will run? What's the > algorithm hadoop uses to get a new node to run reducer? > > Thanks in advance. > > Bye > Abhay >
-- Bertrand Dechoux
+
Bertrand Dechoux 2012-09-03, 15:24
-
Re: knowing the nodes on which reduce tasks will run
Bejoy Ks 2012-09-03, 15:23
HI Abhay
The TaskTrackers on which the reduce tasks are triggered is chosen in random based on the reduce slot availability. So if you don't need the reduce tasks to be scheduled on some particular nodes you need to set 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The bottleneck here is that this property is not a job level one you need to set it on a cluster level.
A cleaner approach will be to configure each of your nodes with the right number of map and reduce slots based on the resources available on each machine.
On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi < [EMAIL PROTECTED]> wrote:
> Hello, > > How can one get to know the nodes on which reduce tasks will run? > > One of my job is running and it's completing all the map tasks. > My map tasks write lots of intermediate data. The intermediate directory > is getting full on all the nodes. > If the reduce task take any node from cluster then It'll try to copy the > data to same disk and it'll eventually fail due to Disk space related > exceptions. > > I have added few more tasktracker nodes in the cluster and now want to run > reducer on new nodes only. > Is it possible to choose a node on which the reducer will run? What's the > algorithm hadoop uses to get a new node to run reducer? > > Thanks in advance. > > Bye > Abhay >
+
Bejoy Ks 2012-09-03, 15:23
-
Re: knowing the nodes on which reduce tasks will run
Narasingu Ramesh 2012-09-04, 08:11
Hi Abhay, NameNode it has address of the all data nodes. MapReduce can do all the data is processing. First data set is putting into HDFS filesystem and then run hadoop jar file. Map task can handle input files for shufle, sorting and grouped together. Map task is completed and then taks Reduce taks yet to start and then run the again sorting the data mean while job tracker and task tracker is running in each job tasks.
Thanks & Regards, Ramesh.Narasingu
On Mon, Sep 3, 2012 at 9:30 PM, Abhay Ratnaparkhi < [EMAIL PROTECTED]> wrote:
> All of my map tasks are about to complete and there is not much processing > to be done in reducer. > The job is running from a week so I don't want the job to fail. Any other > suggestion to tackle this is welcome. > > ~Abhay > > On Mon, Sep 3, 2012 at 9:26 PM, Hemanth Yamijala < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> You are right that a change to mapred.tasktracker.reduce.tasks.maximum >> will require a restart of the tasktrackers. AFAIK, there is no way of >> modifying this property without restarting. >> >> On a different note, could you see if the amount of intermediate data can >> be reduced using a combiner, or some other form of local aggregation ? >> >> Thanks >> hemanth >> >> >> On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi < >> [EMAIL PROTECTED]> wrote: >> >>> How can I set 'mapred.tasktracker.reduce.tasks.maximum' to "0" in a >>> running tasktracker? >>> Seems that I need to restart the tasktracker and in that case I'll loose >>> the output of map tasks by particular tasktracker. >>> >>> Can I change 'mapred.tasktracker.reduce.tasks.maximum' to "0" >>> without restarting tasktracker? >>> >>> ~Abhay >>> >>> >>> On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote: >>> >>>> HI Abhay >>>> >>>> The TaskTrackers on which the reduce tasks are triggered is chosen in >>>> random based on the reduce slot availability. So if you don't need the >>>> reduce tasks to be scheduled on some particular nodes you need to set >>>> 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The >>>> bottleneck here is that this property is not a job level one you need to >>>> set it on a cluster level. >>>> >>>> A cleaner approach will be to configure each of your nodes with the >>>> right number of map and reduce slots based on the resources available on >>>> each machine. >>>> >>>> >>>> On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi < >>>> [EMAIL PROTECTED]> wrote: >>>> >>>>> Hello, >>>>> >>>>> How can one get to know the nodes on which reduce tasks will run? >>>>> >>>>> One of my job is running and it's completing all the map tasks. >>>>> My map tasks write lots of intermediate data. The intermediate >>>>> directory is getting full on all the nodes. >>>>> If the reduce task take any node from cluster then It'll try to copy >>>>> the data to same disk and it'll eventually fail due to Disk space related >>>>> exceptions. >>>>> >>>>> I have added few more tasktracker nodes in the cluster and now want to >>>>> run reducer on new nodes only. >>>>> Is it possible to choose a node on which the reducer will run? What's >>>>> the algorithm hadoop uses to get a new node to run reducer? >>>>> >>>>> Thanks in advance. >>>>> >>>>> Bye >>>>> Abhay >>>>> >>>> >>>> >>> >> >
+
Narasingu Ramesh 2012-09-04, 08:11
|
|