-Re: On the topic of task scheduling
Vasco Visser 2012-09-04, 13:59
On Tue, Sep 4, 2012 at 3:11 PM, Robert Evans <[EMAIL PROTECTED]> wrote:
> The other thing to point out too is that in order to solve this problem
> perfectly you litterly have to solve the halting problem. You have to
> predict if the maps are going to finish quickly or slowly. If they finish
> quickly then you want to launch reduces quickly to start fetching data
> from the mappers, if they are going to finish very slowly, then you have a
> lot of reducers taking up resources not doing anything.
I agree with you that a perfect solution is not going to be feasible.
The aim should probably be a solution that is good in many cases.
> That is why there
> is the config parameter that can be set on a per job basis to tell the AM
> when to start launch maps.
I assume you mean start launching reducers
> We have actually been experimenting with
> setting this to 100% because it improves utilization of the cluster a lot.
thanks for pointing this out, I didn't know about this config option.
That the utilization of the cluster improves by setting this to 1
doesn't surprise me.
Maybe it is a good idea to introduce a concept like "job container
time" that captures how much resources a job uses in its life time.
For example, if a job uses 10 mappers each for a minute and 10
reducers also each for a minute, then the container time would be 20
minutes. Having idle reducer will increase container time.
A conceptually simple method to optimize the container time of a job
is to let the AM monitor for each scheduled reducer how much of the
time it is waiting for mappers to produce intermediate data (maybe
embed this in the heartbeat?). If the average waiting for all
scheduled reducers is above a certain proportion (say waiting more
than 25% of the time or smt), then the AM can decide to discard
some/all reducers and give the freed resources to mappers.
This is just an idea, I don't know about the feasibility. Also I
didn't think about the relationship between optimizing container time
for a single job and optimizing it for all jobs utilizing on the
cluster. Might be that minimizing for each job gives minimal overall,
but not sure.
> On 9/2/12 1:46 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:
>> Welcome to Hadoop!
>> You observations are all correct - in simplest case you launch all
>>reduces up front (we used to do that initially) and get a good 'pipeline'
>>between maps, shuffle (i.e. moving map-outputs to reduces) and the reduce
>> However, one thing to remember is that keeping reduces up and running
>>without sufficient maps being completed is a waste of resources in the
>>cluster. As a result, we have a simple heuristic in hadoop-1 i.e. do not
>>launch reduces until a certain percentage of the job's maps are complete
>>- by default it's set to 5%. However, there still is a flaw with it
>>(regardless of what you set it to be i.e. 5% or 50%). If it's too high,
>>you lose the 'pipeline' and too low (5%), reduces still spin waiting for
>>all maps to complete wasting resources in the cluster.
>> Given that, we've implemented the heuristic you've described below for
>>hadoop-2 which is better at balancing resource-utilization v/s pipelining
>>or job latency.
>> However, as you've pointed out there are several improvements which are
>>feasible. But, remember that the complexity involved has on a number of
>>factors you've already mentioned:
>> # Job size (a job with 100m/10r v/s 100000m/10000r)
>> # Skew for reduces
>> # Resource availability i.e. other active jobs/shuffles in the system,
>>network bandwidth etc.
>> If you look at an ideal shuffle it will look so (pardon my primitive
>> From that graph:
>> # X i.e. when to launch reduces depends on resource availability, job
>>size & maps' completion rate.
>> # Slope of shuffles (red worm) depends on network b/w, skew etc.
>> None of your points are invalid - I'm just pointing out the