Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Non data-local scheduling


Copy link to this message
-
Re: Non data-local scheduling
Hi Andre,

Try setting yarn.scheduler.capacity.node-locality-delay to a number between
0 and 1.  This will turn on delay scheduling - here's the doc on how this
works:

For applications that request containers on particular nodes, the number of
scheduling opportunities since the last container assignment to wait before
accepting a placement on another node. Expressed as a float between 0 and
1, which, as a fraction of the cluster size, is the number of scheduling
opportunities to pass up. The default value of -1.0 means don't pass up any
scheduling opportunities.

-Sandy
On Thu, Oct 3, 2013 at 9:57 AM, André Hacker <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I have a 25 node cluster, running hadoop 2.1.0-beta, with capacity
> scheduler (default settings for scheduler) and replication factor 3.
>
> I have exclusive access to the cluster to run a benchmark job and I wonder
> why there are so few data-local and so many rack-local maps.
>
> The input format calculates 44 input splits and 44 map tasks, however, it
> seems to be random how many of them are processed data locally. Here the
> counters of my last tries:
>
> data-local / rack-local:
> Test 1: data-local:15 rack-local: 29
> Test 2: data-local:18 rack-local: 26
>
> I don't understand why there is not always 100% data local. This should
> not be a problem since the blocks of my input file are distributed over all
> nodes.
>
> Maybe someone can give me a hint.
>
> Thanks,
> André Hacker, TU Berlin
>