Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> HDFS Block size vs Input Split Size


Copy link to this message
-
Re: HDFS Block size vs Input Split Size
Hi,

On Sun, Nov 18, 2012 at 11:25 AM, Majid Azimi <[EMAIL PROTECTED]> wrote:
> hi guys,
>
> I want to get confirmation that I have understood this topic correctly. HDFS
> block size is number of bytes that HDFS will split a large files into small
> tokens. Input split size is number bytes each mapper will actually process.
> It may be less or more than hdfs block size. Am I right?

Yes.

> suppose we want to load a 110MB text file to hdfs. hdfs block size and Input
> split size is set to 64MB.
>
> 1. number of mappers is based on number of Input splits not number of hdfs
> blocks? right?

Correct. Although the default logic tries to borrow the split size
based on the block size, there is no such hard requirement.

> 2. When we set hdfs block to 64MB, Is this exactly 67108864(64*1024*1024)
> bytes? I mean it doesn't matter the file will be splitted from middle of the
> line.

Yes, HDFS makes arbitrary sized blocks. HDFS does not concern itself
about file's contents (like a FileSystem doesn't). The reader is
expected to handle reading records properly (i.e. until last newline,
etc.). See http://wiki.apache.org/hadoop/HadoopMapReduce for how MR
reads the records without a break in between, even if the block
boundary has broken a record.

> 3. Now we have 2 input split (so two maps). Last line of first block and
> first line of second block is not meaningful. TextInputFormat is responsible
> for reading meaningful lines and giving them to map jobs. What
> TextInputFormat does is:
>
> In second block it will seek to second line which is a complete line and
> read from there and gives it to second mapper.
> First mapper will read until the end of first block and also it will process
> the (last incomplete line of first block + first incomplete line of second
> block).

Yes, this is explained at http://wiki.apache.org/hadoop/HadoopMapReduce as well.

> So the Input split size of first mapper is not exactly 64MB. it is a bit
> more than that(first incomplete line of second block). Also Input split size
> of second mapper is a bit less than 64 MB. Am I right?
> So hdfs block size is an exact number but Input split size is based on our
> data logic which may be a little different with the configured number?
> right?

Yes, all correct.

--
Harsh J
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB