Andy's points are reasonable but there are a few omissions,
- modern file systems are pretty good at writing large files into
contiguous blocks if they have a reasonable amount of space available.
- the seeks in question are likely to be more to do with checking
directories for block locations than seeking to small-ish file starts
because modern file systems tend to group together files that are written
at about the same time.
- it is quite possible to build an HDFS-like file system that uses very
small blocks. There really are three considerations here that, when
conflated, make the design more difficult than necessary. These three
the primitive unit of disk allocation
This is the size of disk allocation. For HDFS, this is variable in size
since blocks can be smaller than the max size. The key problem with a
large size here is that it is relatively difficult to allow quick reading
of the file during writing. With a smaller block size, the block can be
committed in a way that the reader can read it much sooner. Extremely
large block sizes also make R/W file systems and snapshots more difficult
for basically the same reason. There is no strong reason that this has to
be conflated with the striping chunk size.
Putting HDFS on top of ext3 or ext4 kind of does this, but because HDFS
knows nothing about the blocks in the underlying system, you don't get the
the unit of node striping
This is the size of data that is sent to each node and is intended to
achieve read parallelism in map-reduce programs. This should be large
enough to cause a map task to take a reasonable time to process in order to
make task scheduling easier. A few hundred megabytes is commonly a good
size, but different applications may prefer sizes as small as a MB or as
large as a few GB.
the unit of scaling
It is typical that something somewhere needs to remember what gets stuck
where in the cluster. Currently the name node does this with blocks.
Blocks are a bad choice here because they come and go quite often which
means that the namenode has to handle lots of changes and because this
makes caching of the name node data or persisting it to disk much harder.
Blocks also tend to limit scaling because you have to have so many of them
in a large system.
A counter-example to the design of HDFS is the MapR architecture. There,
the disk blocks are 8K, chunks are a few hundred megabytes (but flexible
within a single cluster) and the scaling unit is 10's of gigabytes.
Separating these concepts allows disk contiguity, efficient node striping
and simple HA of the file system.
On Fri, Nov 16, 2012 at 11:53 AM, Andy Isaacson <[EMAIL PROTECTED]> wrote:
> On Fri, Nov 16, 2012 at 10:55 AM, Pankaj Gupta <[EMAIL PROTECTED]>
> > The Hadoop Definitive Guide provides comparison with regular file systems
> > and indicates the advantage being lower number of seeks(as far as I
> > understood it, may be I read it incorreclty, if so I apologize). But, as
> > understand, the data node stores data on a regular file system. If this
> > so then how does having a bigger HDFS block size provide better seek
> > performance, when the data will ultimately be read from regular file
> > which has much smaller block size.
> Suppose that HDFS stored data in smaller blocks (64kb for example).
> Then ext4 would have no reason to put those small files close together
> on disk, and reading from a HDFS file would mean reading from very
> many ext4 files, and probably would mean many seeks.
> The large block size design of HDFS avoids that problem by giving ext4
> the information it needs to optimize for our desired use case.
> > I see other advantages of bigger block size though:
> > Less entries on NameNode to keep track of
> That's another benefit.
> > Less switching from datanode to datanode for the HDFS client when
> > the file. If block size were small, just this switching would reduce the