On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <[EMAIL PROTECTED]> wrote:
> hi guys,
> I want to get confirmation that I have understood this topic correctly. HDFS
> block size is number of bytes that HDFS will split a large files into small
> tokens. Input split size is number bytes each mapper will actually process.
> It may be less or more than hdfs block size. Am I right?
> suppose we want to load a 110MB text file to hdfs. hdfs block size and Input
> split size is set to 64MB.
> 1. number of mappers is based on number of Input splits not number of hdfs
> blocks? right?
Correct. Although the default logic tries to borrow the split size
based on the block size, there is no such hard requirement.
> 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
> bytes? I mean it doesn't matter the file will be splitted from middle of the
Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself
about file's contents (like a FileSystem doesn't). The reader is
expected to handle reading records properly (i.e. until last newline,
etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR
reads the records without a break in between, even if the block
boundary has broken a record.
> 3. Now we have 2 input split (so two maps). Last line of first block and
> first line of second block is not meaningful. TextInputFormat is responsible
> for reading meaningful lines and giving them to map jobs. What
> TextInputFormat does is:
> In second block it will seek to second line which is a complete line and
> read from there and gives it to second mapper.
> First mapper will read until the end of first block and also it will process
> the (last incomplete line of first block + first incomplete line of second
Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well.
> So the Input split size of first mapper is not exactly 64MB. it is a bit
> more than that(first incomplete line of second block). Also Input split size
> of second mapper is a bit less than 64 MB. Am I right?
> So hdfs block size is an exact number but Input split size is based on our
> data logic which may be a little different with the configured number?
Yes, all correct.