Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Non data-local scheduling


Copy link to this message
-
Re: Non data-local scheduling
Try playing with the block size vs split size. If the blocks are very
large and the splits small then multiple splits correspond to the same
block and if there are more splits than replicas you get rack local
processing.

On 10/3/2013 12:57 PM, Andr� Hacker wrote:
> Hi,
>
> I have a 25 node cluster, running hadoop 2.1.0-beta, with capacity
> scheduler (default settings for scheduler) and replication factor 3.
>
> I have exclusive access to the cluster to run a benchmark job and I
> wonder why there are so few data-local and so many rack-local maps.
>
> The input format calculates 44 input splits and 44 map tasks, however,
> it seems to be random how many of them are processed data locally.
> Here the counters of my last tries:
>
> data-local / rack-local:
> Test 1: data-local:15 rack-local: 29
> Test 2: data-local:18 rack-local: 26
>
> I don't understand why there is not always 100% data local. This
> should not be a problem since the blocks of my input file are
> distributed over all nodes.
>
> Maybe someone can give me a hint.
>
> Thanks,
> Andr� Hacker, TU Berlin