Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # dev >> About block splitting, input split and TextInputFormat in MapReduce


Copy link to this message
-
Re: About block splitting, input split and TextInputFormat in MapReduce
Yoonmin,

Please give http://wiki.apache.org/hadoop/HadoopMapReduce a read, it
covers the read-records-across-blocks scenario.

On Thu, Oct 17, 2013 at 8:47 PM, Yoonmin Nam <[EMAIL PROTECTED]> wrote:
> Hi.
>
> Let we consider this situation:
>
> 1.     Block size = 67108864 (64MB)
>
> 2.     Data size = 2.2GB… (larger than block size)
>
>
>
> Then, when I put the input into HDFS, I got the below list of block
> replication result:
>
>
>
>
>
> Then, I checked each HDFS block and unfortunately (but naturally) block 2
> and block 3 has broken data like this.
>
>
>
> At the end of block2:
>
> …
>
> …
>
> <username> R. fi
>
>
>
> At the start of block3:
>
> end</username>
>
>
>
> This means the original data is like this: (XML format data)
>
> <username>R. fiend</username>
>
>
>
> If I use the TextInputFormat (LineRecordReader and LineReader),
>
> I thought that mapper 3 which handle block 2 will cover the start of block 3
> to make those line to make the incompletely broken data meaningful!
>
>
>
> And mapper 4 is reading the next element of end</username>. (Actually next
> element is id: <id>55767</id>
>
>
>
> If it is right, then some Mapper has great performance gain if it has some
> block and its adjacent block for handling this kind of block spanning
> problem.
>
> (Because it can reduce the network I/O for get the next block to handle
> broken element)
>
>
>
> At the block replacement result I shown, block 0 and block1 are existed in
> same datanode (10.40.3.78).
>
> Also, block1 and block2 are existed in same datanode (10.40.3.83).
>
>
>
> However, block3 and block4 are not existed at least one same node.. (Both
> two blocks are existed in different datanode)
>
>
>
> At this point, I want to ask you guys about following questions:
>
>
>
> 1.     The block replication policy consider this kind of situation?
>
> 2.     Is there any wrong fact of my thought, especially one mapper handles
> the end of its block and start of next block to make the line meaningful?
>
> 3.     Why the SPLIT_SLOP has value 1.1 in FileInputSplit?
>
> 4.     I know HDFS Block generation mechanism splits the input data strictly
> based on the value of dfs.block.size, and that value is upper value of
> InputSplit. But it is not correct because of SPLIT_SLOPE. But this is wrong,
> I think! Please let me know the exact reason of InputSplitting mechanism!
>
> (Let we consider that the last remaining data is 64.8MB (bytesRemaining) and
> splitSize is 64MB, so bytesRemaining / splitSize == 1.01 < SPLIT_SLOP, so it
> just becomes one input splits!!)
>
>
>
> Thank you for reading my very long question!

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB