Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Assignment of data splits to mappers

Copy link to this message
Assignment of data splits to mappers
When MR assigns data splits to map tasks, does it assign a set of non-contiguous blocks to one map?  The reason I ask is, thinking through the problem, if I were the MR scheduler I would attempt to hand a map task a bunch of blocks that all exist on the same datanode, and then schedule the map task on that node.  E.g. if I have an HDFS file with 10000 blocks and I want to create 1000 map tasks I'd like each map task to have 10 blocks, but those blocks are unlikely to be contiguous on a given datanode.

This is related to a question I had asked earlier, which is whether any benefit could be had by aligning data splits along block boundaries to avoid slopping reads of a block to the next block and requiring another datanode connection.  The answer I got was that the extra connection overhead wasn't important.  The reason I bring this up again is that comments in this discussion (https://issues.apache.org/jira/browse/HADOOP-3315) imply that doing an extra seek to the beginning of the file to read a magic number on open is a significant overhead, and this looks like a similar issue to me.