Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # dev >> Number of bytes per checksum


Copy link to this message
-
Re: Number of bytes per checksum
Doing CRC32 on a huge data block also reduces its error detection
capability.

If you need more information on this topic, this paper will be a good
starting poing:
 
http://www.ece.cmu.edu/~koopman/networks/dsn02/dsn02_koopman.pdf

Kihwal
On 6/24/11 9:50 AM, "Doug Cutting" <[EMAIL PROTECTED]> wrote:

> A smaller checksum interval decreases the overhead for random access.
> If one seeks to a random location, one must, on average, read and
> checksum an extra checksumInterval/2 bytes.  512 was chosen as a value
> that, with four-byte CRC32, reduced the impact on small seeks while
> increasing the storage and transmission overheads by less than 1%.
>
> Increasing the interval would not likely reduce the computation
> significantly, as the same number of bytes are checksummed regardless,
> but it might optimize i/o operations in some cases without harming
> random access much if this were increased to 8k or larger.
>
> Doug
>
> On 06/24/2011 04:24 PM, Praveen Sripati wrote:
>>
>> Hi,
>>
>> Why is the checksum done for io.bytes.per.checksum (defaults to 512)
>> instead of the complete block at once (dfs.block.size defaults to
>> 67108864)? If a block is corrupt then the entire block has to be
>> replicated anyway. Isn't it more efficient to do the checksum for
>> complete block at once?
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB