|
|
-
Re: Understanding of the hadoop distribution system (tuning)
Hemanth Yamijala 2012-09-11, 04:19
Hi,
Responses inline to some points.
On Tue, Sep 11, 2012 at 7:26 AM, Elaine Gan <[EMAIL PROTECTED]> wrote:
> Hi, > > I'm new to hadoop and i've just played around with map reduce. > I would like to check if my understanding to hadoop is correct and i > would appreciate if anyone could correct me if i'm wrong. > > I have a data of around 518MB, and i wrote a MR program to process it. > Here are some of my settings in my mapred-site.xml. > --------------------------------------------------------------- > mapred.tasktracker.map.tasks.maximum = 20 > mapred.tasktracker.reduce.tasks.maximum = 20 > --------------------------------------------------------------- >
These two configurations essentially tell the tasktrackers that they can run 20 maps and 20 reduces in parallel on a machine. Is this what you intended ? (Generally the sum of these two values should equal the number of cores on your tasktracker node, or a little more).
Also, would help if you can tell us your cluster size - i.e. number of slaves. > My block size is default, 64MB > With my data size = 518MB, i guess setting the maximum for MR task to 20 > is far more than enough (518/64 = 8) , did i get it correctly? > > I suppose what you want is to run all the maps in parallel. For that, the number of map slots in your cluster should be more than the number of maps of your job (assuming there's a single job running). If the number of slots is less than number of maps, the maps would be scheduled in multiple waves. On your jobtracker main page, the Cluster Summary > Map Task Capacity gives you the total slots available in your cluster.
> When i run the MR program, i could see in the Map/Reduce Administration > page that the number of Maps Total = 8, so i assume that everything is > going well here, once again if i'm wrong please correct me. > (Sometimes it shows only Maps Total = 3) > > This value tells us the number of maps that will run for the job. > There's one thing which i'm uncertain about hadoop distribution. > Is the Maps Total = 8 means that there are 8 map tasks split among all > the data nodes (task trackers)? > Is there anyway i can checked whether all the tasks are shared among > datanodes (where task trackers are working). >
There's no easy way to check this. The task page for every task shows the attempts that ran for each task and where they ran under the 'Machine' column. > When i clicked on each link under that Task Id, i can see there's "Input > Split Locations" stated under each task details, if the inputs are > splitted between data nodes, does that means that everything is working > well? > > I think this is just the location of the splits, including the replicas. What you could see is if enough data local maps ran - which means that the tasks mostly got their inputs from datanodes running on the same machine as themselves. This is given by the counter "Data-local map tasks" on the job UI page. > I need to make sure i got everything running well because my MR took > around 6 hours to finish despite the input size is small.. (Well, i know > hadoop is not meant for small data), I'm not sure whether it's my > configuration that goes wrong or hadoop is just not suitable for my case. > I'm actually running a mahout kmeans analysis. > > Thank you for your time. > > > > >
-
Re: Understanding of the hadoop distribution system (tuning)
Elaine Gan 2012-09-11, 06:07
Hi Hermanth
Thank you for your detailed answered. Your answers helped me much in understanding, especially on the Job UI.
Sorry, i missed out my specs. NameNode (JobTracker) : CPUx4 DataNode (TaskTracker) : CPUx4
I am replying inline too.
> > I have a data of around 518MB, and i wrote a MR program to process it. > > Here are some of my settings in my mapred-site.xml. > > --------------------------------------------------------------- > > mapred.tasktracker.map.tasks.maximum = 20 > > mapred.tasktracker.reduce.tasks.maximum = 20 > > --------------------------------------------------------------- > > > > These two configurations essentially tell the tasktrackers that they can > run 20 maps and 20 reduces in parallel on a machine. Is this what you > intended ? (Generally the sum of these two values should equal the number > of cores on your tasktracker node, or a little more). > > Also, would help if you can tell us your cluster size - i.e. number of > slaves.
Cluster size (No of slaves) = 4
Yes, i meant the maximum tasks that could be run in A machine is 20 tasks, both map & reduce.
> > My block size is default, 64MB > > With my data size = 518MB, i guess setting the maximum for MR task to 20 > > is far more than enough (518/64 = 8) , did i get it correctly? > > > > > I suppose what you want is to run all the maps in parallel. For that, the > number of map slots in your cluster should be more than the number of maps > of your job (assuming there's a single job running). If the number of slots > is less than number of maps, the maps would be scheduled in multiple waves. > On your jobtracker main page, the Cluster Summary > Map Task Capacity gives > you the total slots available in your cluster.
My Map Task Capacity = 80% So, from the explanation and from my data size and configuration, Data size = 518MB Number of map tasks required = 518/64 = 8 tasks This 8 tasks should be spread among 4 slaves, which means each nodes should be able to handle at least 2 tasks. And my settings was mapred.tasktracker.map.tasks.maximum = 20, which is more than enough, so it means the approach is correct? (Well i have CPUx4 in my machine, so in case of large data, i should divide it by 4 in order to determine the smallest figure for mapred.tasktracker.map.tasks.maximum)
> > When i run the MR program, i could see in the Map/Reduce Administration > > page that the number of Maps Total = 8, so i assume that everything is > > going well here, once again if i'm wrong please correct me. > > (Sometimes it shows only Maps Total = 3) > > > This value tells us the number of maps that will run for the job.
OK > > There's one thing which i'm uncertain about hadoop distribution. > > Is the Maps Total = 8 means that there are 8 map tasks split among all > > the data nodes (task trackers)? > > Is there anyway i can checked whether all the tasks are shared among > > datanodes (where task trackers are working). > > > There's no easy way to check this. The task page for every task shows the > attempts that ran for each task and where they ran under the 'Machine' > column. >
Thank you, i see that they're processed on different "Machine", so i guess it's working correctly :)
> > > When i clicked on each link under that Task Id, i can see there's "Input > > Split Locations" stated under each task details, if the inputs are > > splitted between data nodes, does that means that everything is working > > well? > > > > > I think this is just the location of the splits, including the replicas. > What you could see is if enough data local maps ran - which means that the > tasks mostly got their inputs from datanodes running on the same machine as > themselves. This is given by the counter "Data-local map tasks" on the job > UI page. > There are two cases under the Job UI. Counter Map Reduce Total ----------------------------------------- Case (1) Launched map tasks 0 0 4 Data-local map tasks 0 0 4
Case (2) Launched map tasks 0 0 2 Data-local map tasks 0 0 1
Hmm.. not quite understand this, if case (2) it means two map tasks are actually reading data from same datanode?
But anyway, is this monitoring needed for tuning performance? Thank you.
|
|