-RE: Why big block size for HDFS.
John Lilley 2013-03-31, 18:58
From: Rahul Bhattacharjee [mailto:[EMAIL PROTECTED]]
Subject: Why big block size for HDFS.
>Many places it has been written that to avoid huge no of disk seeks , we store big blocks in HDFS , so that once we seek to the location , then there is only data transfer rate which would be predominant , no more seeks. I am not sure if I have understood this correctly.
>My question is , no matter what the block size we decide , finally its getting written to the computers HDD , which would be formatted and would have a block size in KB's and also while writing to the FS (not HDFS) , its not guaranteed that the blocks that we write are continuous , so there would be disk seeks anyways .The assumption of HDFS would be only true if the underlying Fs guarentees to write the data in continuous blocks.
>Can someone explain a bit.
While there are no guarantees that disk storage will be contiguous, the OS will attempt to keep large files contiguous (and may even defrag over time), and if all files are written using large blocks, this is more likely to be the case. If storage is contiguous, you can write a complete track without seeking. A complete track size varies, but a 1TB disk might have 500KB/track. Stepping adjacent close tracks is also much cheaper than the average seek time, and as you might expect, has been optimized in hardware to assist sequential I/O. However, if you switch storage units, you will probably encounter at least one full seek at the start of the block (since it was probably somewhere else at the time). The result is that, on average, writing sequential files is very fast (>100MB/sec on typical SATA). But I think that the blocks overhead has more to do with finding where to read the next block from, assuming that data has been distributed evenly.
So consider connection overhead when the data is distributed. I am no expert on the Hadoop internals, but I suspect that somewhere, a TCP connection is opened to transfer data. Whether connection overhead is reduced by maintaining open connection pools, I don’t know. But let’s assume that there is *some* overhead for switching data transfer from machine “A” that owns block “1000” and machine “B” that owns block “1001”. The larger the block size, the less significant will be this overhead relative to the sequential transfer rate.
In addition, MapR/YARN has an easier time of scheduling if there are fewer blocks.