Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HDFS >> mail # user >> Understanding compression in hdfs


+
Yaron Gonen 2012-07-29, 14:40
+
Tim Broberg 2012-07-29, 15:40
+
Yaron Gonen 2012-07-29, 17:35
+
Brock Noland 2012-07-29, 17:41
Copy link to this message
-
Re: Understanding compression in hdfs
Is the checksum pluggable? CRC-32 is good for error detection not for
duplication check. I need this for duplication check.

On Sun, Jul 29, 2012 at 8:41 PM, Brock Noland <[EMAIL PROTECTED]> wrote:

> Also note that HDFS already does checksums which I believe you
> can retrieve:
>
>
> http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path)
>
> http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity
>
> Brock
>
>
> On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen <[EMAIL PROTECTED]>wrote:
>
>> Thanks!
>> I'll dig into those classes to figure out my next step.
>>
>> Anyway, I just realized the block-level compression has nothing to do
>> with HDFS blocks. An HDFS block can contain an unknown number of compressed
>> blocks, which makes my efforts kind of worthless.
>>
>> thanks again!
>>
>>
>> On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <[EMAIL PROTECTED]>wrote:
>>
>>>  What if you wrote a CompressionOutputStream class that wraps around
>>> the existing ones and outputs a hash per <n> bytes and a
>>> CompressionInputStream that checks them? ...and a Codec that wraps your
>>> compressors around arbitrary existing codecs.
>>>
>>>  Sounds like a bunch of work, and I'm not sure where you would store
>>> the hashes, but it would get the data into your clutches the instant it's
>>> available.
>>>
>>>     - Tim.
>>>
>>> On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>   Hi,
>>> I've created a SequeceFile.Writer with block-level compression.
>>> I'd like to create a SHA1 hash for each block written. How do I do that?
>>> I didn't see any way to take the compression under my control in order to
>>> know when a block is over.
>>>
>>>  Thanks,
>>> Yaron
>>>
>>>
>>> ------------------------------
>>> The information contained in this email is intended only for the
>>> personal and confidential use of the recipient(s) named above. The
>>> information and any attached documents contained in this message may be
>>> Exar confidential and/or legally privileged. If you are not the intended
>>> recipient, you are hereby notified that any review, use, dissemination or
>>> reproduction of this message is strictly prohibited and may be unlawful. If
>>> you have received this communication in error, please notify us immediately
>>> by return email and delete the original message.
>>>
>>
>>
>
>
> --
> Apache MRUnit - Unit testing MapReduce -
> http://incubator.apache.org/mrunit/
>