Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> LZO with sequenceFile


Copy link to this message
-
Re: LZO with sequenceFile
On Sun, Feb 26, 2012 at 1:49 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> Hi Mohit,
>
> On Sun, Feb 26, 2012 at 10:42 PM, Mohit Anchlia <[EMAIL PROTECTED]> wrote:
>> Thanks! Some questions I have is:
>> 1. Would it work with sequence files? I am using
>> SequenceFileAsTextInputStream
>
> Yes, you just need to set the right codec when you write the file.
> Reading is then normal as reading a non-compressed sequence-file.
>
> The codec classnames are stored as meta information into sequence
> files and are read back to load the right codec for the reader - thus
> you don't have to specify a 'reader' codec once you are done writing a
> file with any codec of choice.
>
>> 2. If I use SequenceFile.CompressionType.RECORD or BLOCK would it still
>> split the files?
>
> Yes SequenceFiles are a natively splittable file format, designed for
> HDFS and MapReduce. Compressed sequence files are thus splittable too.
>
> You mostly need block compression unless your records are large in
> size and you feel you'll benefit better with compression algorithms
> applied to a single, complete record instead of a bunch of records.
>
>> 3. I am also using CDH's 20.2 version of hadoop.
>
> http://www.cloudera.com/assets/images/diagrams/whats-in-a-version.png :)
>
> --
> Harsh J

LZO confuses most because how it was added and removed. Also there is
a system to make raw LZO files split-table by indexing it.

I have just patched google-snappy into 0.20.2. Snappy has a similar
performance profile to LZO, good compression low processor overhead.
It does not have all the licence issues and there is not thousands and
semi contradictory/confusing information it ends up being easier to
setup and use.

http://code.google.com/p/snappy/

Recent version of hadoop just snappy build in so it will just work out
of the box.

Edward
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB