|
|
-
Re: On the topic of task schedulingArun C Murthy 2012-09-02, 18:46
Vasco,
Welcome to Hadoop! You observations are all correct - in simplest case you launch all reduces up front (we used to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce itself. However, one thing to remember is that keeping reduces up and running without sufficient maps being completed is a waste of resources in the cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps are complete - by default it's set to 5%. However, there still is a flaw with it (regardless of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster. Given that, we've implemented the heuristic you've described below for hadoop-2 which is better at balancing resource-utilization v/s pipelining or job latency. However, as you've pointed out there are several improvements which are feasible. But, remember that the complexity involved has on a number of factors you've already mentioned: # Job size (a job with 100m/10r v/s 100000m/10000r) # Skew for reduces # Resource availability i.e. other active jobs/shuffles in the system, network bandwidth etc. If you look at an ideal shuffle it will look so (pardon my primitive scribble): http://people.apache.org/~acmurthy/ideal-shuffle.png From that graph: # X i.e. when to launch reduces depends on resource availability, job size & maps' completion rate. # Slope of shuffles (red worm) depends on network b/w, skew etc. None of your points are invalid - I'm just pointing out the possibilities and complexities. Your points about aggregation are also valid, look at http://code.google.com/p/sailfish/ for e.g. One of the advantages of hadoop-2 is that anyone can play with these heuristics and implement your own - I'd love to help if you are interested in playing with them. Related jiras: https://issues.apache.org/jira/browse/MAPREDUCE-4584 hth, Arun On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote: > Hi, > > I am new to the list, I am working with hadoop in the context of my > MSc graduation project (has nothing to do with task scheduling per > se). I came across task scheduling because I ran into the fifo > starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where > the fifo starvation issue is solved. The behavior of task scheduling I > observe in this branch is as follows. It begins with all containers > allocated to mappers. Pretty quickly reducers are starting to be > scheduled. In a linear way more containers are given to reducers, > until about 50% (does anybody know why 50%?) of available containers > are reducers (this point is reached when ~ 50% of the mappers are > finished). It stays ~50-50 for until all mappers are scheduled. Only > then the proportion of containers allocated to reducers is increased > to > 50%. > > I don't think this is in general quite the optimal (in terms of total > job completion time) scheduling behavior. The reason being that the > last reducer can only be scheduled when a free container becomes > available after all mappers are scheduled. Thus, in order to shorten > total job completion time the last reducer must be scheduled as early > as possible. > > For the following gedankenexperiment, assume # reducer is set to 99% > capacity, as suggested somewhere in the hadoop docs, and that each > reducer will process roughly the same amount of work. I am going to > schedule as in 2.1.0, but instead of allocating reducers slowly up to > 50 % of capacity, I am just going to take away containers. Thus, the > amount of map work is the same as in 2.1.0, only no reduce work will > be done. At the point that the proportion of reducers would increased > to more than 50% of the containers (i.e., near the end of the map > phase), I schedule all reducers in the containers I took away, making Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ |