Try setting yarn.scheduler.capacity.node-locality-delay to a number between
0 and 1. This will turn on delay scheduling - here's the doc on how this
For applications that request containers on particular nodes, the number of
scheduling opportunities since the last container assignment to wait before
accepting a placement on another node. Expressed as a float between 0 and
1, which, as a fraction of the cluster size, is the number of scheduling
opportunities to pass up. The default value of -1.0 means don't pass up any
On Thu, Oct 3, 2013 at 9:57 AM, André Hacker <[EMAIL PROTECTED]> wrote:
> I have a 25 node cluster, running hadoop 2.1.0-beta, with capacity
> scheduler (default settings for scheduler) and replication factor 3.
> I have exclusive access to the cluster to run a benchmark job and I wonder
> why there are so few data-local and so many rack-local maps.
> The input format calculates 44 input splits and 44 map tasks, however, it
> seems to be random how many of them are processed data locally. Here the
> counters of my last tries:
> data-local / rack-local:
> Test 1: data-local:15 rack-local: 29
> Test 2: data-local:18 rack-local: 26
> I don't understand why there is not always 100% data local. This should
> not be a problem since the blocks of my input file are distributed over all
> Maybe someone can give me a hint.
> André Hacker, TU Berlin