-Re: HDFS block size
Andy Isaacson 2012-11-16, 19:53
On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <[EMAIL PROTECTED]> wrote:
> The Hadoop Definitive Guide provides comparison with regular file systems
> and indicates the advantage being lower number of seeks(as far as I
> understood it, may be I read it incorreclty, if so I apologize). But, as I
> understand, the data node stores data on a regular file system. If this is
> so then how does having a bigger HDFS block size provide better seek
> performance, when the data will ultimately be read from regular file system
> which has much smaller block size.
Suppose that HDFS stored data in smaller blocks (64kb for example).
Then ext4 would have no reason to put those small files close together
on disk, and reading from a HDFS file would mean reading from very
many ext4 files, and probably would mean many seeks.
The large block size design of HDFS avoids that problem by giving ext4
the information it needs to optimize for our desired use case.
> I see other advantages of bigger block size though:
> Less entries on NameNode to keep track of
That's another benefit.
> Less switching from datanode to datanode for the HDFS client when fetching
> the file. If block size were small, just this switching would reduce the
> performance a lot. Perhaps this is the seek that the definitive guide refers
If one were building HDFS with a smaller block size, you'd probably
have to overlap block fetches from many data nodes in order to get
decent performance. So yes, this "switching" as you term it would be a
> Less overhead cost of setting up Map tasks. The way MR usually works is that
> one Map task is created per block. Smaller block will mean less computation
> per map task and thus overhead of setting up the map task would become
A MR designed for a small-block-HDFS would probably have to do
something different rather than one mapper per block.
> I want to make sure I understand the advantages of having a larger block
> size. I specifically want to know whether there is any advantage in terms of
> disk seeks; that one thing has got me very confused.
Seems like you have a pretty good understanding of the issues, and I
hope I clarified the seek issue above.