-On the topic of task scheduling
Vasco Visser 2012-09-02, 16:34
I am new to the list, I am working with hadoop in the context of my
MSc graduation project (has nothing to do with task scheduling per
se). I came across task scheduling because I ran into the fifo
starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
the fifo starvation issue is solved. The behavior of task scheduling I
observe in this branch is as follows. It begins with all containers
allocated to mappers. Pretty quickly reducers are starting to be
scheduled. In a linear way more containers are given to reducers,
until about 50% (does anybody know why 50%?) of available containers
are reducers (this point is reached when ~ 50% of the mappers are
finished). It stays ~50-50 for until all mappers are scheduled. Only
then the proportion of containers allocated to reducers is increased
to > 50%.
I don't think this is in general quite the optimal (in terms of total
job completion time) scheduling behavior. The reason being that the
last reducer can only be scheduled when a free container becomes
available after all mappers are scheduled. Thus, in order to shorten
total job completion time the last reducer must be scheduled as early
For the following gedankenexperiment, assume # reducer is set to 99%
capacity, as suggested somewhere in the hadoop docs, and that each
reducer will process roughly the same amount of work. I am going to
schedule as in 2.1.0, but instead of allocating reducers slowly up to
50 % of capacity, I am just going to take away containers. Thus, the
amount of map work is the same as in 2.1.0, only no reduce work will
be done. At the point that the proportion of reducers would increased
to more than 50% of the containers (i.e., near the end of the map
phase), I schedule all reducers in the containers I took away, making
sure that the last reducer is scheduled at the same moment as it would
be in 2.1.0. My claim is that the job completion time of this
hypothetical scheduling is about the same as the scheduling in 2.1.0
(as the last reducer is scheduled at the same time), even though I
took away 50% of the available resources for a large part of the job!
The conclusion is that it would be better to allocate all available
containers to mappers, and that reducers are starting to be scheduled
when the map phase is nearing its end, instead of right at the
beginning of the job.
Scheduling reducers early seems to me the way to go only when: 1) the
output from mappers is very skewed, i.e., some reducers are expected
to need much more time than others, 2) the network connection between
nodes is (expected to be) a big bottleneck, i.e., schedule reducers
early to smear out data transfer over the lifetime of a job, or 3)
there is no contention for resource containers.
with regard to point 1: skewedness can be determined by looking at
relative sizes of partitioned mapper output.
with regard to point 2: I think the network is only a bottleneck if it
feeds tuples slower than the reducer can merge sort the tuples (am I
right?). Also, it might be a nice optimization to transfer the
intermediate data to the machine that is going/likely to run a
specific reducer before the reducer is actually ran there (e.g.,
something like a per machine prefetch manager?). A per machine task
scheduling queue would be needed for this, to determine where a
reducer is going/likely to be scheduled.
Just my two cents. I'm interested in hearing opinions on this matter.