I want to get confirmation that I have understood this topic
correctly. HDFS block size is number of bytes that HDFS will split a large
files into small tokens. Input split size is number bytes each mapper will
actually process. It may be less or more than hdfs block size. Am* *I right?
suppose we want to load a 110MB text file to hdfs. hdfs block size and
Input split size is set to 64MB.
1. number of mappers is based on number of Input splits not number of hdfs
2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
bytes? I mean it doesn't matter the file will be splitted from middle of
3. Now we have 2 input split (so two maps). Last line of first block and
first line of second block is not meaningful. TextInputFormat is
responsible for reading meaningful lines and giving them to map jobs. What
TextInputFormat does is:
- In second block it will seek to second line which is a complete line
and read from there and gives it to second mapper.
- First mapper will read until the end of first block and also it will
process the (last incomplete line of first block + first incomplete line of
So the Input split size of first mapper is not exactly 64MB. it is a bit
more than that(first incomplete line of second block). Also Input split
size of second mapper is a bit less than 64 MB. Am I right?
So hdfs block size is an exact number but Input split size is based on our
data logic which may be a little different with the configured number?