|
|
-
Understanding of the hadoop distribution system (tuning)
Elaine Gan 2012-09-11, 01:56
Hi,
I'm new to hadoop and i've just played around with map reduce. I would like to check if my understanding to hadoop is correct and i would appreciate if anyone could correct me if i'm wrong.
I have a data of around 518MB, and i wrote a MR program to process it. Here are some of my settings in my mapred-site.xml. --------------------------------------------------------------- mapred.tasktracker.map.tasks.maximum = 20 mapred.tasktracker.reduce.tasks.maximum = 20 --------------------------------------------------------------- My block size is default, 64MB With my data size = 518MB, i guess setting the maximum for MR task to 20 is far more than enough (518/64 = 8) , did i get it correctly?
When i run the MR program, i could see in the Map/Reduce Administration page that the number of Maps Total = 8, so i assume that everything is going well here, once again if i'm wrong please correct me. (Sometimes it shows only Maps Total = 3)
There's one thing which i'm uncertain about hadoop distribution. Is the Maps Total = 8 means that there are 8 map tasks split among all the data nodes (task trackers)? Is there anyway i can checked whether all the tasks are shared among datanodes (where task trackers are working). When i clicked on each link under that Task Id, i can see there's "Input Split Locations" stated under each task details, if the inputs are splitted between data nodes, does that means that everything is working well?
I need to make sure i got everything running well because my MR took around 6 hours to finish despite the input size is small.. (Well, i know hadoop is not meant for small data), I'm not sure whether it's my configuration that goes wrong or hadoop is just not suitable for my case. I'm actually running a mahout kmeans analysis.
Thank you for your time.
-
Re: Understanding of the hadoop distribution system (tuning)
Jagat Singh 2012-09-11, 02:17
Hello Elaine,
You did not tell your cluster size. Number of nodes , cores in each node.
What sort of work you are doing , 6 hours for 518MB data is huge time.
The number of map tasks would be 518/64
So this many map tasks needs to run to process your data.
Now they can run on single node or multiple nodes depending on available slots. Did you check job tracker page while execution is taking place , there you can see at which node its being processed. You can go to Running tasks page.
Regards,
Jagat Singh On Tue, Sep 11, 2012 at 11:56 AM, Elaine Gan <[EMAIL PROTECTED]> wrote:
> Hi, > > I'm new to hadoop and i've just played around with map reduce. > I would like to check if my understanding to hadoop is correct and i > would appreciate if anyone could correct me if i'm wrong. > > I have a data of around 518MB, and i wrote a MR program to process it. > Here are some of my settings in my mapred-site.xml. > --------------------------------------------------------------- > mapred.tasktracker.map.tasks.maximum = 20 > mapred.tasktracker.reduce.tasks.maximum = 20 > --------------------------------------------------------------- > My block size is default, 64MB > With my data size = 518MB, i guess setting the maximum for MR task to 20 > is far more than enough (518/64 = 8) , did i get it correctly? > > When i run the MR program, i could see in the Map/Reduce Administration > page that the number of Maps Total = 8, so i assume that everything is > going well here, once again if i'm wrong please correct me. > (Sometimes it shows only Maps Total = 3) > > There's one thing which i'm uncertain about hadoop distribution. > Is the Maps Total = 8 means that there are 8 map tasks split among all > the data nodes (task trackers)? > Is there anyway i can checked whether all the tasks are shared among > datanodes (where task trackers are working). > When i clicked on each link under that Task Id, i can see there's "Input > Split Locations" stated under each task details, if the inputs are > splitted between data nodes, does that means that everything is working > well? > > I need to make sure i got everything running well because my MR took > around 6 hours to finish despite the input size is small.. (Well, i know > hadoop is not meant for small data), I'm not sure whether it's my > configuration that goes wrong or hadoop is just not suitable for my case. > I'm actually running a mahout kmeans analysis. > > Thank you for your time. > > > > >
-
Re: Understanding of the hadoop distribution system (tuning)
Bejoy Ks 2012-09-11, 06:42
Hi Elaine
Slots (mapred.tasktracker.[map/reduce].tasks.maximum) are configured on a cluster/node/TaskTracker level and not on a job level. You configure this based on the available resources on each of the nodes. Of this you need to consider the number of cores not number of CPUs. Say you have 4 quad core processors then you have 16 cores, if they are hyper threaded you can consider the effective number of cores as 1 - 1.5 times the actual number. Also when you specify slots you need to consider memory, if a task jvm (mapred.child.java.opts) is configured with 2Gigs and you have just 16GB memory available at your disposal then you can have only 8*2=16 slots. If you configure more number of slots it can lead to swapping and OOM issues when all slots are used in parallel.
When map task are run you get good level of data local map tasks if you have a good replication factor, the default of 3 is good.
Map tasks are scheduled on nodes by JT much based on data locality and available slots. So you cannot say that the map tasks will be uniformly distributed across the cluster. If you have 8 map slots on a TT, assume a map reduce job having 8 map tasks and the data corresponding to these 8 map tasks on the same node then all the 8 map tasks can be on the same node as well.
Few Responses inline
Case (2) Launched map tasks 0 0 2 Data-local map tasks 0 0 1
Hmm.. not quite understand this, if case (2) it means two map tasks are actually reading data from same datanode? [Bejoy] yes 2 had data on the same node where map tasks are executed and 1 had the task running on one node and is pulling the data from another node.
But anyway, is this monitoring needed for tuning performance? [Bejoy] If you see less number of data local map tasks the you need to seriously look into as it can degrade the performance to a greater extent. In larger data volumes a few non data local map tasks are common.
Regards Bejoy KS
On Tue, Sep 11, 2012 at 11:37 AM, Elaine Gan <[EMAIL PROTECTED]> wrote:
> Hi Hermanth > > Thank you for your detailed answered. Your answers helped me much in > understanding, especially on the Job UI. > > Sorry, i missed out my specs. > NameNode (JobTracker) : CPUx4 > DataNode (TaskTracker) : CPUx4 > > I am replying inline too. > > > > I have a data of around 518MB, and i wrote a MR program to process it. > > > Here are some of my settings in my mapred-site.xml. > > > --------------------------------------------------------------- > > > mapred.tasktracker.map.tasks.maximum = 20 > > > mapred.tasktracker.reduce.tasks.maximum = 20 > > > --------------------------------------------------------------- > > > > > > > These two configurations essentially tell the tasktrackers that they can > > run 20 maps and 20 reduces in parallel on a machine. Is this what you > > intended ? (Generally the sum of these two values should equal the number > > of cores on your tasktracker node, or a little more). > > > > Also, would help if you can tell us your cluster size - i.e. number of > > slaves. > > Cluster size (No of slaves) = 4 > > Yes, i meant the maximum tasks that could be run in A machine is 20 > tasks, both map & reduce. > > > > My block size is default, 64MB > > > With my data size = 518MB, i guess setting the maximum for MR task to > 20 > > > is far more than enough (518/64 = 8) , did i get it correctly? > > > > > > > > I suppose what you want is to run all the maps in parallel. For that, the > > number of map slots in your cluster should be more than the number of > maps > > of your job (assuming there's a single job running). If the number of > slots > > is less than number of maps, the maps would be scheduled in multiple > waves. > > On your jobtracker main page, the Cluster Summary > Map Task Capacity > gives > > you the total slots available in your cluster. > > My Map Task Capacity = 80% > So, from the explanation and from my data size and configuration, > Data size = 518MB > Number of map tasks required = 518/64 = 8 tasks > This 8 tasks should be spread among 4 slaves, which means each nodes
|
|