Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # dev - About block splitting, input split and TextInputFormat in MapReduce


Copy link to this message
-
Re: About block splitting, input split and TextInputFormat in MapReduce
Harsh J 2013-10-19, 06:02
Yoonmin,

Please give http://wiki.apache.org/hadoop/HadoopMapReduce a read, it
covers the read-records-across-blocks scenario.

On Thu, Oct 17, 2013 at 8:47 PM, Yoonmin Nam <[EMAIL PROTECTED]> wrote:
> Hi.
>
> Let we consider this situation:
>
> 1.     Block size = 67108864 (64MB)
>
> 2.     Data size = 2.2GB… (larger than block size)
>
>
>
> Then, when I put the input into HDFS, I got the below list of block
> replication result:
>
>
>
>
>
> Then, I checked each HDFS block and unfortunately (but naturally) block 2
> and block 3 has broken data like this.
>
>
>
> At the end of block2:
>
> …
>
> …
>
> <username> R. fi
>
>
>
> At the start of block3:
>
> end</username>
>
>
>
> This means the original data is like this: (XML format data)
>
> <username>R. fiend</username>
>
>
>
> If I use the TextInputFormat (LineRecordReader and LineReader),
>
> I thought that mapper 3 which handle block 2 will cover the start of block 3
> to make those line to make the incompletely broken data meaningful!
>
>
>
> And mapper 4 is reading the next element of end</username>. (Actually next
> element is id: <id>55767</id>
>
>
>
> If it is right, then some Mapper has great performance gain if it has some
> block and its adjacent block for handling this kind of block spanning
> problem.
>
> (Because it can reduce the network I/O for get the next block to handle
> broken element)
>
>
>
> At the block replacement result I shown, block 0 and block1 are existed in
> same datanode (10.40.3.78).
>
> Also, block1 and block2 are existed in same datanode (10.40.3.83).
>
>
>
> However, block3 and block4 are not existed at least one same node.. (Both
> two blocks are existed in different datanode)
>
>
>
> At this point, I want to ask you guys about following questions:
>
>
>
> 1.     The block replication policy consider this kind of situation?
>
> 2.     Is there any wrong fact of my thought, especially one mapper handles
> the end of its block and start of next block to make the line meaningful?
>
> 3.     Why the SPLIT_SLOP has value 1.1 in FileInputSplit?
>
> 4.     I know HDFS Block generation mechanism splits the input data strictly
> based on the value of dfs.block.size, and that value is upper value of
> InputSplit. But it is not correct because of SPLIT_SLOPE. But this is wrong,
> I think! Please let me know the exact reason of InputSplitting mechanism!
>
> (Let we consider that the last remaining data is 64.8MB (bytesRemaining) and
> splitSize is 64MB, so bytesRemaining / splitSize == 1.01 < SPLIT_SLOP, so it
> just becomes one input splits!!)
>
>
>
> Thank you for reading my very long question!

--
Harsh J