|
|
-
Help: How to increase amont maptasks per job ?
Tali K 2011-01-07, 20:58
We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster.
+
Tali K 2011-01-07, 20:58
-
Re: Help: How to increase amont maptasks per job ?
Ted Yu 2011-01-07, 21:19
Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml
On Fri, Jan 7, 2011 at 12:58 PM, Tali K <[EMAIL PROTECTED]> wrote:
> > > > > We have a jobs which runs in several map/reduce stages. In the first job, > a large number of map tasks -82 are initiated, as expected. > And that cause all nodes to be used. > In a > later job, where we are still dealing with large amounts of > data, only 4 map tasks are initiated, and that caused to use only 4 nodes. > This stage is actually the > workhorse of the job, and requires much more processing power than the > initial stage. > We are trying to understand why only a few map tasks are > being used, as we are not getting the full advantage of our cluster. > > > >
+
Ted Yu 2011-01-07, 21:19
-
Re: Help: How to increase amont maptasks per job ?
Rahul Jain 2011-01-07, 21:37
Also make sure you've enough input files for the next stage mappers to work with... Read thru the input splits part of tutorial: http://wiki.apache.org/hadoop/HadoopMapReduceIf the last stage had only 4 reducers running, they'd generate 4 output files. This will limit the # of mappers started in the next stage to 4, unless you tune your input split parameters or write a custom input split. Hope this helps, there is lot more literature on this on the web and hadoop books released till date. -Rahul On Fri, Jan 7, 2011 at 1:19 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > Set higher values for mapred.tasktracker.map.tasks.maximum (and > mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml > > On Fri, Jan 7, 2011 at 12:58 PM, Tali K <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > We have a jobs which runs in several map/reduce stages. In the first > job, > > a large number of map tasks -82 are initiated, as expected. > > And that cause all nodes to be used. > > In a > > later job, where we are still dealing with large amounts of > > data, only 4 map tasks are initiated, and that caused to use only 4 > nodes. > > This stage is actually the > > workhorse of the job, and requires much more processing power than the > > initial stage. > > We are trying to understand why only a few map tasks are > > being used, as we are not getting the full advantage of our cluster. > > > > > > > > >
+
Rahul Jain 2011-01-07, 21:37
-
RE: Help: How to increase amont maptasks per job ?
Tali K 2011-01-07, 21:40
According to the documentation, that parameter is for the number of tasks *per TaskTracker*. I am asking about the number of tasks for the entire job and entire cluster. That parameter is already set to 3, which is one less than the number of cores on each node's CPU, as recommended.In my question I stated that 82 tasks were run for the first job, yet only 4 for the second - both numbers being cluster-wide.
> Date: Fri, 7 Jan 2011 13:19:42 -0800 > Subject: Re: Help: How to increase amont maptasks per job ? > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > Set higher values for mapred.tasktracker.map.tasks.maximum (and > mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml > > On Fri, Jan 7, 2011 at 12:58 PM, Tali K <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > We have a jobs which runs in several map/reduce stages. In the first job, > > a large number of map tasks -82 are initiated, as expected. > > And that cause all nodes to be used. > > In a > > later job, where we are still dealing with large amounts of > > data, only 4 map tasks are initiated, and that caused to use only 4 nodes. > > This stage is actually the > > workhorse of the job, and requires much more processing power than the > > initial stage. > > We are trying to understand why only a few map tasks are > > being used, as we are not getting the full advantage of our cluster. > > > > > > > >
+
Tali K 2011-01-07, 21:40
-
Re: Help: How to increase amont maptasks per job ?
Niels Basjes 2011-01-07, 21:44
You said you have a large amount of data. How large is that approximately? Did you compress the intermediate data (with what codec)?
Niels
2011/1/7 Tali K <[EMAIL PROTECTED]>: > > According to the documentation, that parameter is for the number of > tasks *per TaskTracker*. I am asking about the number of tasks > for the entire job and entire cluster. That parameter is already > set to 3, which is one less than the number of cores on each node's > CPU, as recommended.In my question I stated that > 82 tasks were run for the first job, yet only 4 for the second - > both numbers being cluster-wide. > > > >> Date: Fri, 7 Jan 2011 13:19:42 -0800 >> Subject: Re: Help: How to increase amont maptasks per job ? >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> >> Set higher values for mapred.tasktracker.map.tasks.maximum (and >> mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml >> >> On Fri, Jan 7, 2011 at 12:58 PM, Tali K <[EMAIL PROTECTED]> wrote: >> >> > >> > >> > >> > >> > We have a jobs which runs in several map/reduce stages. In the first job, >> > a large number of map tasks -82 are initiated, as expected. >> > And that cause all nodes to be used. >> > In a >> > later job, where we are still dealing with large amounts of >> > data, only 4 map tasks are initiated, and that caused to use only 4 nodes. >> > This stage is actually the >> > workhorse of the job, and requires much more processing power than the >> > initial stage. >> > We are trying to understand why only a few map tasks are >> > being used, as we are not getting the full advantage of our cluster. >> > >> > >> > >> > >
-- Met vriendelijke groeten,
Niels Basjes
+
Niels Basjes 2011-01-07, 21:44
-
Re: Help: How to increase amont maptasks per job ?
Ted Yu 2011-01-07, 21:47
Check out mapred.map.tasks and mapred.reduce.tasks
On Fri, Jan 7, 2011 at 1:40 PM, Tali K <[EMAIL PROTECTED]> wrote:
> > According to the documentation, that parameter is for the number of > tasks *per TaskTracker*. I am asking about the number of tasks > for the entire job and entire cluster. That parameter is already > set to 3, which is one less than the number of cores on each node's > CPU, as recommended.In my question I stated that > 82 tasks were run for the first job, yet only 4 for the second - > both numbers being cluster-wide. > > > > > Date: Fri, 7 Jan 2011 13:19:42 -0800 > > Subject: Re: Help: How to increase amont maptasks per job ? > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > Set higher values for mapred.tasktracker.map.tasks.maximum (and > > mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml > > > > On Fri, Jan 7, 2011 at 12:58 PM, Tali K <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > > > > > We have a jobs which runs in several map/reduce stages. In the first > job, > > > a large number of map tasks -82 are initiated, as expected. > > > And that cause all nodes to be used. > > > In a > > > later job, where we are still dealing with large amounts of > > > data, only 4 map tasks are initiated, and that caused to use only 4 > nodes. > > > This stage is actually the > > > workhorse of the job, and requires much more processing power than the > > > initial stage. > > > We are trying to understand why only a few map tasks are > > > being used, as we are not getting the full advantage of our cluster. > > > > > > > > > > > > > >
+
Ted Yu 2011-01-07, 21:47
-
Re: Help: How to increase amont maptasks per job ?
Harsh J 2011-01-08, 04:12
It would depend on your input format. If the job is using an InputFormat that does not let it split files, you would get only mappers == no. of files. For splittable input files, you get mappers > no. of files. Little more information on what the input format is could help tracking down the problem a bit more.
On Sat, Jan 8, 2011 at 3:10 AM, Tali K <[EMAIL PROTECTED]> wrote: > > According to the documentation, that parameter is for the number of > tasks *per TaskTracker*. I am asking about the number of tasks > for the entire job and entire cluster. That parameter is already > set to 3, which is one less than the number of cores on each node's > CPU, as recommended.In my question I stated that > 82 tasks were run for the first job, yet only 4 for the second - > both numbers being cluster-wide. > > > >> Date: Fri, 7 Jan 2011 13:19:42 -0800 >> Subject: Re: Help: How to increase amont maptasks per job ? >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> >> Set higher values for mapred.tasktracker.map.tasks.maximum (and >> mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml >> >> On Fri, Jan 7, 2011 at 12:58 PM, Tali K <[EMAIL PROTECTED]> wrote: >> >> > >> > >> > >> > >> > We have a jobs which runs in several map/reduce stages. In the first job, >> > a large number of map tasks -82 are initiated, as expected. >> > And that cause all nodes to be used. >> > In a >> > later job, where we are still dealing with large amounts of >> > data, only 4 map tasks are initiated, and that caused to use only 4 nodes. >> > This stage is actually the >> > workhorse of the job, and requires much more processing power than the >> > initial stage. >> > We are trying to understand why only a few map tasks are >> > being used, as we are not getting the full advantage of our cluster. >> > >> > >> > >> > >
-- Harsh J www.harshj.com
+
Harsh J 2011-01-08, 04:12
|
|