Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Recommendations for compression


+
rakesh sharma 2012-05-23, 18:15
Copy link to this message
-
Re: Recommendations for compression
Hi Rakesh,

You have quite a few options based on space-time tradeoff you want to make.

Gzip compresses well but is CPU intensive - not splittable so parallelism
and network IO suffers

Snappy is not space efficient but is easy on CPU (great for map output
compression) - not splittable, unless you use it within a container like
SequenceFile

LZO has a good space-time balance and is used by several companies
operating Hadoop (LZO is splittable and fast which is a major advantage in
using it) https://github.com/kevinweil/hadoop-lzo

Bzip2 compresses well, is splittable but is CPU intensive.

Based on your requirements, you could go with one of these. Makes sense?
On Wed, May 23, 2012 at 11:15 AM, rakesh sharma <[EMAIL PROTECTED]
> wrote:

>
> Hi Guys,
> I am writing data in hadoop using java client. The source of data for java
> client is a messaging data. The java client rotates files every 15 minutes.
> I use PigServer to submit map reduce job on the just closed file. These
> files have data in text format and are very large in size. I am not using
> any compression currently but would like to explore as amount of data is
> increasing day-by-day.
> I need to use a compression while writing data to hadoop and make pig
> aware of this compression while submitting map reduce jobs. I am looking
> for some guidance to understand my options.
> Thanks,Rakesh
+
rakesh sharma 2012-05-23, 19:01
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB