Snappy vs LZO -
To implement lzo, there are several steps, starting from building hadoop-lzo library. Finally we got it built. Indexing had to be done as a separate step and the lzo indexing does alter the way the files are stored and thus not use hadoop's in built mapper. Snappy on the other hand comes packages with Cloudera. Since we are using Cloudera distribution, this makes sense to us. Lzo compresses better than snappy but for us that was okay since the performance is better with snappy sequence file vs lzo
Rc file vs sequencefile - would have gone with RC file for all the resons given below but for the reason like Bejoy said, sequence file is widely used. Looks like sqoop may support sequence file with hive import and since we are using sqoop a lot, sequence file is a better choice.
Also tested going back and forth from one compression to another compression and one file format to another file format since that is possible, we can switch the compression or file format later if we need to.
From: yongqiang he [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, June 27, 2012 12:41 AM
To: [EMAIL PROTECTED]
Subject: Re: hive - snappy and sequence file vs RC file
Can you share the reason of choosing snappy as your compression codec?
Like @omalley mentioned, RCFile will compress the data more densely, and will avoid reading data not required in your hive query. And I think Facebook use it to store tens of PB (if not hundred PB) of data.
On Tue, Jun 26, 2012 at 9:49 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:
> SequenceFile compared to RCFile:
> * More widely deployed.
> * Available from MapReduce and Pig
> * Doesn't compress as small (in RCFile all of each columns values
> are put
> * Uncompresses and deserializes all of the columns, even if you are
> only reading a few
> In either case, for long term storage, you should seriously consider
> the default codec since that will provide much tighter compression (at
> the cost of cpu to compress it).
> -- Owen