Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # dev >> Number of bytes per checksum


+
Praveen Sripati 2011-06-24, 14:24
+
Doug Cutting 2011-06-24, 14:50
Copy link to this message
-
Re: Number of bytes per checksum
Doing CRC32 on a huge data block also reduces its error detection
capability.

If you need more information on this topic, this paper will be a good
starting poing:
 
http://www.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf

Kihwal
On 6/24/11 9:50 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> A smaller checksum interval decreases the overhead for random access.
> If one seeks to a random location, one must, on average, read and
> checksum an extra checksumInterval/2 bytes.  512 was chosen as a value
> that, with four-byte CRC32, reduced the impact on small seeks while
> increasing the storage and transmission overheads by less than 1%.
>
> Increasing the interval would not likely reduce the computation
> significantly, as the same number of bytes are checksummed regardless,
> but it might optimize i/o operations in some cases without harming
> random access much if this were increased to 8k or larger.
>
> Doug
>
> On 06/24/2011 04:24 PM, Praveen Sripati wrote:
>>
>> Hi,
>>
>> Why is the checksum done for io.bytes.per.checksum (defaults to 512)
>> instead of the complete block at once (dfs.block.size defaults to
>> 67108864)? If a block is corrupt then the entire block has to be
>> replicated anyway. Isn't it more efficient to do the checksum for
>> complete block at once?
>>
>