Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # dev >> On the topic of task scheduling


+
Vasco Visser 2012-09-02, 16:34
Copy link to this message
-
Re: On the topic of task scheduling
Vasco,

 Welcome to Hadoop!

 You observations are all correct - in simplest case you launch all reduces up front (we used to do that initially) and get a good 'pipeline' between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce itself.

 However, one thing to remember is that keeping reduces up and running without sufficient maps being completed is a waste of resources in the cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not launch reduces until a certain percentage of the job's maps are complete - by default it's set to 5%. However, there still is a flaw with it (regardless of what you set it to be i.e. 5% or 50%). If it's too high, you lose the 'pipeline' and too low (5%), reduces still spin waiting for all maps to complete wasting resources in the cluster.

 Given that, we've implemented the heuristic you've described below for hadoop-2 which is better at balancing resource-utilization v/s pipelining or job latency.

 However, as you've pointed out there are several improvements which are feasible. But, remember that the complexity involved has on a number of factors you've already mentioned:
 # Job size (a job with 100m/10r v/s 100000m/10000r)
 # Skew for reduces
 # Resource availability i.e. other active jobs/shuffles in the system, network bandwidth etc.

 If you look at an ideal shuffle it will look so (pardon my primitive scribble):
 http://people.apache.org/~acmurthy/ideal-shuffle.png

 From that graph:
 # X i.e. when to launch reduces depends on resource availability, job size & maps' completion rate.
 # Slope of shuffles (red worm) depends on network b/w, skew etc.

 None of your points are invalid - I'm just pointing out the possibilities and complexities.

 Your points about aggregation are also valid, look at http://code.google.com/p/sailfish/ for e.g.

 One of the advantages of hadoop-2 is that anyone can play with these heuristics and implement your own - I'd love to help if you are interested in playing with them.

 Related jiras:
 https://issues.apache.org/jira/browse/MAPREDUCE-4584

hth,
Arun

On Sep 2, 2012, at 9:34 AM, Vasco Visser wrote:

> Hi,
>
> I am new to the list, I am working with hadoop in the context of my
> MSc graduation project (has nothing to do with task scheduling per
> se). I came across task scheduling because I ran into the fifo
> starvation bug (MAPREDUCE-4613). Now, I am running 2.1.0 branch where
> the fifo starvation issue is solved. The behavior of task scheduling I
> observe in this branch is as follows. It begins with all containers
> allocated to mappers. Pretty quickly reducers are starting to be
> scheduled. In a linear way more containers are given to reducers,
> until about 50% (does anybody know why 50%?) of available containers
> are reducers (this point is reached when ~ 50% of the mappers are
> finished). It stays ~50-50 for until all mappers are scheduled. Only
> then the proportion of containers allocated to reducers is increased
> to > 50%.
>
> I don't think this is in general quite the optimal (in terms of total
> job completion time) scheduling behavior. The reason being that the
> last reducer can only be scheduled when a free container becomes
> available after all mappers are scheduled. Thus, in order to shorten
> total job completion time the last reducer must be scheduled as early
> as possible.
>
> For the following gedankenexperiment, assume # reducer is set to 99%
> capacity, as suggested somewhere in the hadoop docs, and that each
> reducer will process roughly the same amount of work. I am going to
> schedule as in 2.1.0, but instead of allocating reducers slowly up to
> 50 % of capacity, I am just going to take away containers. Thus, the
> amount of map work is the same as in 2.1.0, only no reduce work will
> be done. At the point that the proportion of reducers would increased
> to more than 50% of the containers (i.e., near the end of the map
> phase), I schedule all reducers in the containers I took away, making

Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/
+
Robert Evans 2012-09-04, 13:11
+
Vasco Visser 2012-09-04, 13:59
+
Robert Evans 2012-09-04, 18:19
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB