Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> Understanding compression in hdfs


Copy link to this message
-
Re: Understanding compression in hdfs
Is the checksum pluggable? CRC-32 is good for error detection not for
duplication check. I need this for duplication check.

On Sun, Jul 29, 2012 at 8:41 PM, Brock Noland <[EMAIL PROTECTED]> wrote:

> Also note that HDFS already does checksums which I believe you
> can retrieve:
>
>
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)
>
> http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity
>
> Brock
>
>
> On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen <[EMAIL PROTECTED]>wrote:
>
>> Thanks!
>> I'll dig into those classes to figure out my next step.
>>
>> Anyway, I just realized the block-level compression has nothing to do
>> with HDFS blocks. An HDFS block can contain an unknown number of compressed
>> blocks, which makes my efforts kind of worthless.
>>
>> thanks again!
>>
>>
>> On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <[EMAIL PROTECTED]>wrote:
>>
>>>  What if you wrote a CompressionOutputStream class that wraps around
>>> the existing ones and outputs a hash per <n> bytes and a
>>> CompressionInputStream that checks them? ...and a Codec that wraps your
>>> compressors around arbitrary existing codecs.
>>>
>>>  Sounds like a bunch of work, and I'm not sure where you would store
>>> the hashes, but it would get the data into your clutches the instant it's
>>> available.
>>>
>>>     - Tim.
>>>
>>> On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>   Hi,
>>> I've created a SequeceFile.Writer with block-level compression.
>>> I'd like to create a SHA1 hash for each block written. How do I do that?
>>> I didn't see any way to take the compression under my control in order to
>>> know when a block is over.
>>>
>>>  Thanks,
>>> Yaron
>>>
>>>
>>> ------------------------------
>>> The information contained in this email is intended only for the
>>> personal and confidential use of the recipient(s) named above. The
>>> information and any attached documents contained in this message may be
>>> Exar confidential and/or legally privileged. If you are not the intended
>>> recipient, you are hereby notified that any review, use, dissemination or
>>> reproduction of this message is strictly prohibited and may be unlawful. If
>>> you have received this communication in error, please notify us immediately
>>> by return email and delete the original message.
>>>
>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB