Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Assignment of data splits to mappers


Copy link to this message
-
Assignment of data splits to mappers
When MR assigns data splits to map tasks, does it assign a set of non-contiguous blocks to one map?  The reason I ask is, thinking through the problem, if I were the MR scheduler I would attempt to hand a map task a bunch of blocks that all exist on the same datanode, and then schedule the map task on that node.  E.g. if I have an HDFS file with 10000 blocks and I want to create 1000 map tasks I'd like each map task to have 10 blocks, but those blocks are unlikely to be contiguous on a given datanode.

This is related to a question I had asked earlier, which is whether any benefit could be had by aligning data splits along block boundaries to avoid slopping reads of a block to the next block and requiring another datanode connection.  The answer I got was that the extra connection overhead wasn't important.  The reason I bring this up again is that comments in this discussion (https://issues.apache.org/jira/browse/HADOOP-3315) imply that doing an extra seek to the beginning of the file to read a magic number on open is a significant overhead, and this looks like a similar issue to me.

Thanks,
john

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB