On Sun, Feb 26, 2012 at 1:49 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi Mohit,
> On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
>> Thanks! Some questions I have is:
>> 1. Would it work with sequence files? I am using
> Yes, you just need to set the right codec when you write the file.
> Reading is then normal as reading a non-compressed sequence-file.
> The codec classnames are stored as meta information into sequence
> files and are read back to load the right codec for the reader - thus
> you don't have to specify a 'reader' codec once you are done writing a
> file with any codec of choice.
>> 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
>> split the files?
> Yes SequenceFiles are a natively splittable file format, designed for
> HDFS and MapReduce. Compressed sequence files are thus splittable too.
> You mostly need block compression unless your records are large in
> size and you feel you'll benefit better with compression algorithms
> applied to a single, complete record instead of a bunch of records.
>> 3. I am also using CDH's 20.2 version of hadoop.
> http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :)
> Harsh J
LZO confuses most because how it was added and removed. Also there is
a system to make raw LZO files split-table by indexing it.
I have just patched google-snappy into 0.20.2. Snappy has a similar
performance profile to LZO, good compression low processor overhead.
It does not have all the licence issues and there is not thousands and
semi contradictory/confusing information it ends up being easier to
setup and use.
Recent version of hadoop just snappy build in so it will just work out
of the box.