|
Yaron Gonen
2012-07-29, 14:40
Tim Broberg
2012-07-29, 15:40
Yaron Gonen
2012-07-29, 17:35
Brock Noland
2012-07-29, 17:41
Yaron Gonen
2012-07-29, 20:23
|
-
Understanding compression in hdfsYaron Gonen 2012-07-29, 14:40
Hi,
I've created a SequeceFile.Writer with block-level compression. I'd like to create a SHA1 hash for each block written. How do I do that? I didn't see any way to take the compression under my control in order to know when a block is over. Thanks, Yaron
-
Re: Understanding compression in hdfsTim Broberg 2012-07-29, 15:40
What if you wrote a CompressionOutputStream class that wraps around the existing ones and outputs a hash per <n> bytes and a CompressionInputStream that checks them? ...and a Codec that wraps your compressors around arbitrary existing codecs.
Sounds like a bunch of work, and I'm not sure where you would store the hashes, but it would get the data into your clutches the instant it's available. - Tim. On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi, I've created a SequeceFile.Writer with block-level compression. I'd like to create a SHA1 hash for each block written. How do I do that? I didn't see any way to take the compression under my control in order to know when a block is over. Thanks, Yaron ________________________________ The information contained in this email is intended only for the personal and confidential use of the recipient(s) named above. The information and any attached documents contained in this message may be Exar confidential and/or legally privileged. If you are not the intended recipient, you are hereby notified that any review, use, dissemination or reproduction of this message is strictly prohibited and may be unlawful. If you have received this communication in error, please notify us immediately by return email and delete the original message.
-
Re: Understanding compression in hdfsYaron Gonen 2012-07-29, 17:35
Thanks!
I'll dig into those classes to figure out my next step. Anyway, I just realized the block-level compression has nothing to do with HDFS blocks. An HDFS block can contain an unknown number of compressed blocks, which makes my efforts kind of worthless. thanks again! On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <[EMAIL PROTECTED]> wrote: > What if you wrote a CompressionOutputStream class that wraps around the > existing ones and outputs a hash per <n> bytes and a CompressionInputStream > that checks them? ...and a Codec that wraps your compressors around > arbitrary existing codecs. > > Sounds like a bunch of work, and I'm not sure where you would store the > hashes, but it would get the data into your clutches the instant it's > available. > > - Tim. > > On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <[EMAIL PROTECTED]> wrote: > > Hi, > I've created a SequeceFile.Writer with block-level compression. > I'd like to create a SHA1 hash for each block written. How do I do that? I > didn't see any way to take the compression under my control in order to > know when a block is over. > > Thanks, > Yaron > > > ------------------------------ > The information contained in this email is intended only for the personal > and confidential use of the recipient(s) named above. The information and > any attached documents contained in this message may be Exar confidential > and/or legally privileged. If you are not the intended recipient, you are > hereby notified that any review, use, dissemination or reproduction of this > message is strictly prohibited and may be unlawful. If you have received > this communication in error, please notify us immediately by return email > and delete the original message. >
-
Re: Understanding compression in hdfsBrock Noland 2012-07-29, 17:41
Also note that HDFS already does checksums which I believe you can retrieve:
http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity Brock On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen <[EMAIL PROTECTED]> wrote: > Thanks! > I'll dig into those classes to figure out my next step. > > Anyway, I just realized the block-level compression has nothing to do with > HDFS blocks. An HDFS block can contain an unknown number of compressed > blocks, which makes my efforts kind of worthless. > > thanks again! > > > On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <[EMAIL PROTECTED]> wrote: > >> What if you wrote a CompressionOutputStream class that wraps around the >> existing ones and outputs a hash per <n> bytes and a CompressionInputStream >> that checks them? ...and a Codec that wraps your compressors around >> arbitrary existing codecs. >> >> Sounds like a bunch of work, and I'm not sure where you would store the >> hashes, but it would get the data into your clutches the instant it's >> available. >> >> - Tim. >> >> On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <[EMAIL PROTECTED]> wrote: >> >> Hi, >> I've created a SequeceFile.Writer with block-level compression. >> I'd like to create a SHA1 hash for each block written. How do I do that? >> I didn't see any way to take the compression under my control in order to >> know when a block is over. >> >> Thanks, >> Yaron >> >> >> ------------------------------ >> The information contained in this email is intended only for the personal >> and confidential use of the recipient(s) named above. The information and >> any attached documents contained in this message may be Exar confidential >> and/or legally privileged. If you are not the intended recipient, you are >> hereby notified that any review, use, dissemination or reproduction of this >> message is strictly prohibited and may be unlawful. If you have received >> this communication in error, please notify us immediately by return email >> and delete the original message. >> > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
-
Re: Understanding compression in hdfsYaron Gonen 2012-07-29, 20:23
Is the checksum pluggable? CRC-32 is good for error detection not for
duplication check. I need this for duplication check. On Sun, Jul 29, 2012 at 8:41 PM, Brock Noland <[EMAIL PROTECTED]> wrote: > Also note that HDFS already does checksums which I believe you > can retrieve: > > > http://hadoop.apache.org/common/docs/r1.0.3/api/org/apache/hadoop/fs/FileSystem.html#getFileChecksum(org.apache.hadoop.fs.Path) > > http://hadoop.apache.org/common/docs/r1.0.3/hdfs_design.html#Data+Integrity > > Brock > > > On Sun, Jul 29, 2012 at 12:35 PM, Yaron Gonen <[EMAIL PROTECTED]>wrote: > >> Thanks! >> I'll dig into those classes to figure out my next step. >> >> Anyway, I just realized the block-level compression has nothing to do >> with HDFS blocks. An HDFS block can contain an unknown number of compressed >> blocks, which makes my efforts kind of worthless. >> >> thanks again! >> >> >> On Sun, Jul 29, 2012 at 6:40 PM, Tim Broberg <[EMAIL PROTECTED]>wrote: >> >>> What if you wrote a CompressionOutputStream class that wraps around >>> the existing ones and outputs a hash per <n> bytes and a >>> CompressionInputStream that checks them? ...and a Codec that wraps your >>> compressors around arbitrary existing codecs. >>> >>> Sounds like a bunch of work, and I'm not sure where you would store >>> the hashes, but it would get the data into your clutches the instant it's >>> available. >>> >>> - Tim. >>> >>> On Jul 29, 2012, at 7:41 AM, "Yaron Gonen" <[EMAIL PROTECTED]> >>> wrote: >>> >>> Hi, >>> I've created a SequeceFile.Writer with block-level compression. >>> I'd like to create a SHA1 hash for each block written. How do I do that? >>> I didn't see any way to take the compression under my control in order to >>> know when a block is over. >>> >>> Thanks, >>> Yaron >>> >>> >>> ------------------------------ >>> The information contained in this email is intended only for the >>> personal and confidential use of the recipient(s) named above. The >>> information and any attached documents contained in this message may be >>> Exar confidential and/or legally privileged. If you are not the intended >>> recipient, you are hereby notified that any review, use, dissemination or >>> reproduction of this message is strictly prohibited and may be unlawful. If >>> you have received this communication in error, please notify us immediately >>> by return email and delete the original message. >>> >> >> > > > -- > Apache MRUnit - Unit testing MapReduce - > http://incubator.apache.org/mrunit/ > |