Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: Assignment of data splits to mappers


Copy link to this message
-
Re: Assignment of data splits to mappers
Bertrand Dechoux 2013-06-13, 21:37
The first question can be split (no pun intended) into two topics because
there is actually two distinct steps. First, the InputFormat partitions the
data source into InputSplits. Its implementation will determine the exact
logic. Then the scheduler is responsible for ordering where/when the
InputSplit should be processed. But it doesn't really deal with block
itself. The InputSplit itself knows on which node the data would be local
or not.

If there is no other choice, you (or more exactly the implementation) can
choose to have several blocks per InputSplit. But of course, it open lots
of issues. The default strategy is one block per InputSplit (and thus per
map task because there is one map task per InputSplit). If you really need
to put several blocks per InputSplit, the root cause might often be that
the block size is not big enough. I think it is fair to assume that the
10000 block file your are referring to is not using a 512MB block size.

MultiFileInputFormat does make InputSplit with blocks that are unlikely to
be on the same datanode. But that's a good decision in regard to the kind
of data source it has to deal with. Anyway, two 'continuous' blocks are
also very unlikely to be on the same datanode (and even less the same HDD,
and even less really continuous). The only abstraction to tell whether
record of data should be close one from the other is the block. That's why
the idea is not really to optimize read of 'continuous' blocks on the same
machine/HDD but to consider whether the block size is the right one.

HDFS and Hadoop MapReduce have been designed to work together but there is
a clean abstraction between them. HDFS does not know about records and
clients writing to HDFS (like MapReduce) do not often need to know the
block boundaries explicitly. That's why the RecordReader provided by the
InputSplit is responsible for interpreting the data into records. But of
course, it has to know how to deal with records stored on the block
boundary. It will happen. The advantage is that the record logic can not
corrupt the storage and can be selected at read time. TextInputFormat,
KeyValueTextInputFormat and NLineInputFormat have different strategies
which is only possible due to this abstraction. And that's also why
MapReduce can read/write to other kinds of 'datastorage', like HBase for
example : because it is not tightly coupled with HDFS. But it does also
bring drawbacks.

Regards

Bertrand

On Thu, Jun 13, 2013 at 7:57 PM, John Lilley <[EMAIL PROTECTED]>wrote:

>  When MR assigns data splits to map tasks, does it assign a set of
> non-contiguous blocks to one map?  The reason I ask is, thinking through
> the problem, if I were the MR scheduler I would attempt to hand a map task
> a bunch of blocks that all exist on the same datanode, and then schedule
> the map task on that node.  E.g. if I have an HDFS file with 10000 blocks
> and I want to create 1000 map tasks I’d like each map task to have 10
> blocks, but those blocks are unlikely to be contiguous on a given datanode.
> ****
>
> ** **
>
> This is related to a question I had asked earlier, which is whether any
> benefit could be had by aligning data splits along block boundaries to
> avoid slopping reads of a block to the next block and requiring another
> datanode connection.  The answer I got was that the extra connection
> overhead wasn’t important.  The reason I bring this up again is that
> comments in this discussion (
> https://issues.apache.org/jira/browse/HADOOP-3315) imply that doing an
> extra seek to the beginning of the file to read a magic number on open is a
> significant overhead, and this looks like a similar issue to me.****
>
> ** **
>
> Thanks,****
>
> john****
>
> ** **
>

--
Bertrand Dechoux