Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # dev >> (Re)About block splitting, input split and TextInputFormat in MapReduce

Copy link to this message
(Re)About block splitting, input split and TextInputFormat in MapReduce

Let we consider this situation:

1.     Block size = 67108864 (64MB)

2.     Data size = 2.2GB. (larger than block size)


Then, when I put the input into HDFS, I got the below list of block
replication result:






Then, I checked each HDFS block and unfortunately (but naturally) block 2
and block 3 has broken data like this.


At the end of block2:



<username> R. fi


At the start of block3:



This means the original data is like this: (XML format data)

<username>R. fiend</username>


If I use the TextInputFormat (LineRecordReader and LineReader),

I thought that mapper 3 which handle block 2 will cover the start of block 3
to make those line to make the incompletely broken data meaningful!


And mapper 4 is reading the next element of end</username>. (Actually next
element is id: <id>55767</id>


If it is right, then some Mapper has great performance gain if it has some
block and its adjacent block for handling this kind of block spanning

(Because it can reduce the network I/O for get the next block to handle
broken element)


At the block replacement result I shown, block 0 and block1 are existed in
same datanode (

Also, block1 and block2 are existed in same datanode (


However, block3 and block4 are not existed at least one same node.. (Both
two blocks are existed in different datanode)


At this point, I want to ask you guys about following questions:


1.     The block replication policy consider this kind of situation?

2.     Is there any wrong fact of my thought, especially one mapper handles
the end of its block and start of next block to make the line meaningful?

3.     Why the SPLIT_SLOP has value 1.1 in FileInputSplit?

4.     I know HDFS Block generation mechanism splits the input data strictly
based on the value of dfs.block.size, and that value is upper value of
InputSplit. But it is not correct because of SPLIT_SLOPE. But this is wrong,
I think! Please let me know the exact reason of InputSplitting mechanism!

(Let we consider that the last remaining data is 64.8MB (bytesRemaining) and
splitSize is 64MB, so bytesRemaining / splitSize == 1.01 < SPLIT_SLOP, so it
just becomes one input splits!!)


Thank you for reading my very long question!