Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Understanding of the hadoop distribution system (tuning)

Copy link to this message
Re: Understanding of the hadoop distribution system (tuning)
Hi Elaine

Slots (mapred.tasktracker.[map/reduce].tasks.maximum) are configured on a
cluster/node/TaskTracker level and not on a job level. You configure this
based on the available resources on each of the nodes. Of this you need to
consider the number of cores not number of CPUs. Say you have 4 quad
core processors then you have 16 cores, if they are hyper threaded you can
consider the effective number of cores as 1 - 1.5 times the actual number.
Also when you specify slots you need to consider memory, if a task jvm
(mapred.child.java.opts) is configured with 2Gigs and you have just 16GB
memory available at your disposal then you can have only 8*2=16 slots. If
you configure more number of slots it can lead to swapping and OOM issues
when all slots are used in parallel.

When map task are run you get good level of data local map tasks if you
have a good replication factor, the default of 3 is good.

Map tasks are scheduled on nodes by JT much based on
data locality and available slots. So you cannot say that the map tasks
will be uniformly distributed across the cluster. If you have 8 map slots
on a TT, assume a map reduce job having 8 map tasks and the data
corresponding to these 8 map tasks on the same node then all the 8 map
tasks can be on the same node as well.

Few Responses inline

Case (2)
Launched map tasks 0 0 2
Data-local map tasks 0 0 1

Hmm.. not quite understand this, if case (2) it means two map tasks are
actually reading data from same datanode?
[Bejoy] yes 2 had data on the same node where map tasks are executed and 1
had the task running on one node and is pulling the data from another node.

But anyway, is this monitoring needed for tuning performance?
[Bejoy] If you see less number of data local map tasks the you need
to seriously look into as it can degrade the performance to a greater
extent. In larger data volumes a few non data local map tasks are common.

Bejoy KS

On Tue, Sep 11, 2012 at 11:37 AM, Elaine Gan <[EMAIL PROTECTED]> wrote:

> Hi Hermanth
> Thank you for your detailed answered. Your answers helped me much in
> understanding, especially on the Job UI.
> Sorry, i missed out my specs.
> NameNode (JobTracker) : CPUx4
> DataNode (TaskTracker) : CPUx4
> I am replying inline too.
> > > I have a data of around 518MB, and i wrote a MR program to process it.
> > > Here are some of my settings in my mapred-site.xml.
> > > ---------------------------------------------------------------
> > > mapred.tasktracker.map.tasks.maximum = 20
> > > mapred.tasktracker.reduce.tasks.maximum = 20
> > > ---------------------------------------------------------------
> > >
> >
> > These two configurations essentially tell the tasktrackers that they can
> > run 20 maps and 20 reduces in parallel on a machine. Is this what you
> > intended ? (Generally the sum of these two values should equal the number
> > of cores on your tasktracker node, or a little more).
> >
> > Also, would help if you can tell us your cluster size - i.e. number of
> > slaves.
> Cluster size (No of slaves) = 4
> Yes, i meant the maximum tasks that could be run in A machine is 20
> tasks, both map & reduce.
> > > My block size is default, 64MB
> > > With my data size = 518MB, i guess setting the maximum for MR task to
> 20
> > > is far more than enough (518/64 = 8) , did i get it correctly?
> > >
> > >
> > I suppose what you want is to run all the maps in parallel. For that, the
> > number of map slots in your cluster should be more than the number of
> maps
> > of your job (assuming there's a single job running). If the number of
> slots
> > is less than number of maps, the maps would be scheduled in multiple
> waves.
> > On your jobtracker main page, the Cluster Summary > Map Task Capacity
> gives
> > you the total slots available in your cluster.
> My Map Task Capacity = 80%
> So, from the explanation and from my data size and configuration,
> Data size = 518MB
> Number of map tasks required =  518/64 = 8 tasks
> This 8 tasks should be spread among 4 slaves, which means each nodes