Welcome to Hadoop!
You observations are all correct - in simplest case you launch all reduces up front (we used to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce itself.
However, one thing to remember is that keeping reduces up and running without sufficient maps being completed is a waste of resources in the cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps are complete - by default it's set to 5%. However, there still is a flaw with it (regardless of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster.
Given that, we've implemented the heuristic you've described below for hadoop-2 which is better at balancing resource-utilization v/s pipelining or job latency.
However, as you've pointed out there are several improvements which are feasible. But, remember that the complexity involved has on a number of factors you've already mentioned:
# Job size (a job with 100m/10r v/s 100000m/10000r)
# Skew for reduces
# Resource availability i.e. other active jobs/shuffles in the system, network bandwidth etc.
If you look at an ideal shuffle it will look so (pardon my primitive scribble):
From that graph:
# X i.e. when to launch reduces depends on resource availability, job size & maps' completion rate.
# Slope of shuffles (red worm) depends on network b/w, skew etc.
None of your points are invalid - I'm just pointing out the possibilities and complexities.
Your points about aggregation are also valid, look at http://code.google.com/p/sailfish/ for e.g.
One of the advantages of hadoop-2 is that anyone can play with these heuristics and implement your own - I'd love to help if you are interested in playing with them.
On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:
> I am new to the list, I am working with hadoop in the context of my
> MSc graduation project (has nothing to do with task scheduling per
> se). I came across task scheduling because I ran into the fifo
> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
> the fifo starvation issue is solved. The behavior of task scheduling I
> observe in this branch is as follows. It begins with all containers
> allocated to mappers. Pretty quickly reducers are starting to be
> scheduled. In a linear way more containers are given to reducers,
> until about 50% (does anybody know why 50%?) of available containers
> are reducers (this point is reached when ~ 50% of the mappers are
> finished). It stays ~50-50 for until all mappers are scheduled. Only
> then the proportion of containers allocated to reducers is increased
> to > 50%.
> I don't think this is in general quite the optimal (in terms of total
> job completion time) scheduling behavior. The reason being that the
> last reducer can only be scheduled when a free container becomes
> available after all mappers are scheduled. Thus, in order to shorten
> total job completion time the last reducer must be scheduled as early
> as possible.
> For the following gedankenexperiment, assume # reducer is set to 99%
> capacity, as suggested somewhere in the hadoop docs, and that each
> reducer will process roughly the same amount of work. I am going to
> schedule as in 2.1.0, but instead of allocating reducers slowly up to
> 50 % of capacity, I am just going to take away containers. Thus, the
> amount of map work is the same as in 2.1.0, only no reduce work will
> be done. At the point that the proportion of reducers would increased
> to more than 50% of the containers (i.e., near the end of the map
> phase), I schedule all reducers in the containers I took away, making
Arun C. Murthy