You have quite a few options based on space-time tradeoff you want to make.
Gzip compresses well but is CPU intensive - not splittable so parallelism
and network IO suffers
Snappy is not space efficient but is easy on CPU (great for map output
compression) - not splittable, unless you use it within a container like
LZO has a good space-time balance and is used by several companies
operating Hadoop (LZO is splittable and fast which is a major advantage in
using it) https://github.com/kevinweil/hadoop-lzo
Bzip2 compresses well, is splittable but is CPU intensive.
Based on your requirements, you could go with one of these. Makes sense?
On Wed, May 23, 2012 at 11:15 AM, rakesh sharma <[EMAIL PROTECTED]
> Hi Guys,
> I am writing data in hadoop using java client. The source of data for java
> client is a messaging data. The java client rotates files every 15 minutes.
> I use PigServer to submit map reduce job on the just closed file. These
> files have data in text format and are very large in size. I am not using
> any compression currently but would like to explore as amount of data is
> increasing day-by-day.
> I need to use a compression while writing data to hadoop and make pig
> aware of this compression while submitting map reduce jobs. I am looking
> for some guidance to understand my options.