-RE: About block splitting, input split and TextInputFormat in MapReduce
Bikas Saha 2013-10-17, 17:34
The overall answer is that InputFormat implementations determine how to
split their data across block boundaries and then handle the read in order
to not have incomplete records.
When splits are generated then they typically don’t have block information.
They have an offset and length into the file and so the reader reads from
the offset to offset+length. Thus, if the offset and length are correctly
generated then the last record will not be incomplete since the reader will
read from the next block if needed. The next block may not be local to the
reader and so this may be a performance issue. Its up to the InputFormat
and readers to resolve that.
*From:* Yoonmin Nam [mailto:[EMAIL PROTECTED]]
*Sent:* Thursday, October 17, 2013 8:17 AM
*To:* [EMAIL PROTECTED]
*Subject:* About block splitting, input split and TextInputFormat in
Let we consider this situation:
1. Block size = 67108864 (64MB)
2. Data size = 2.2GB… (larger than block size)
Then, when I put the input into HDFS, I got the below list of block
Then, I checked each HDFS block and unfortunately (but naturally) block 2
and block 3 has broken data like this.
At the end of block2:
<username> R. fi
At the start of block3:
This means the original data is like this: (XML format data)
If I use the TextInputFormat (LineRecordReader and LineReader),
I thought that mapper 3 which handle block 2 will cover the start of block
3 to make those line to make the incompletely broken data meaningful!
And mapper 4 is reading the next element of end</username>. (Actually next
element is id: <id>55767</id>
If it is right, then some Mapper has great performance gain if it has some
block and its adjacent block for handling this kind of block spanning
(Because it can reduce the network I/O for get the next block to handle
At the block replacement result I shown, block 0 and block1 are existed in
same datanode (10.40.3.78).
Also, block1 and block2 are existed in same datanode (10.40.3.83).
However, block3 and block4 are not existed at least one same node.. (Both
two blocks are existed in different datanode)
At this point, I want to ask you guys about following questions:
1. The block replication policy consider this kind of situation?
2. Is there any wrong fact of my thought, especially one mapper handles
the end of its block and start of next block to make the line meaningful?
3. Why the SPLIT_SLOP has value 1.1 in FileInputSplit?
4. I know HDFS Block generation mechanism splits the input data
strictly based on the value of dfs.block.size, and that value is upper
value of InputSplit. But it is not correct because of SPLIT_SLOPE. But this
is wrong, I think! Please let me know the exact reason of InputSplitting
(Let we consider that the last remaining data is 64.8MB (bytesRemaining)
and splitSize is 64MB, so bytesRemaining / splitSize == 1.01 < SPLIT_SLOP,
so it just becomes one input splits!!)
Thank you for reading my very long question!
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.