Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Recommendations for compression


Copy link to this message
-
Re: Recommendations for compression
Hi Rakesh,

You have quite a few options based on space-time tradeoff you want to make.

Gzip compresses well but is CPU intensive - not splittable so parallelism
and network IO suffers

Snappy is not space efficient but is easy on CPU (great for map output
compression) - not splittable, unless you use it within a container like
SequenceFile

LZO has a good space-time balance and is used by several companies
operating Hadoop (LZO is splittable and fast which is a major advantage in
using it) https://github.com/kevinweil/hadoop-lzo

Bzip2 compresses well, is splittable but is CPU intensive.

Based on your requirements, you could go with one of these. Makes sense?
On Wed, May 23, 2012 at 11:15 AM, rakesh sharma <[EMAIL PROTECTED]
> wrote:

>
> Hi Guys,
> I am writing data in hadoop using java client. The source of data for java
> client is a messaging data. The java client rotates files every 15 minutes.
> I use PigServer to submit map reduce job on the just closed file. These
> files have data in text format and are very large in size. I am not using
> any compression currently but would like to explore as amount of data is
> increasing day-by-day.
> I need to use a compression while writing data to hadoop and make pig
> aware of this compression while submitting map reduce jobs. I am looking
> for some guidance to understand my options.
> Thanks,Rakesh