|
|
-
Re: Recommendations for compressionPrashant Kommireddi 2012-05-23, 18:28
Hi Rakesh,
You have quite a few options based on space-time tradeoff you want to make. Gzip compresses well but is CPU intensive - not splittable so parallelism and network IO suffers Snappy is not space efficient but is easy on CPU (great for map output compression) - not splittable, unless you use it within a container like SequenceFile LZO has a good space-time balance and is used by several companies operating Hadoop (LZO is splittable and fast which is a major advantage in using it) https://github.com/kevinweil/hadoop-lzo Bzip2 compresses well, is splittable but is CPU intensive. Based on your requirements, you could go with one of these. Makes sense? On Wed, May 23, 2012 at 11:15 AM, rakesh sharma <[EMAIL PROTECTED] > wrote: > > Hi Guys, > I am writing data in hadoop using java client. The source of data for java > client is a messaging data. The java client rotates files every 15 minutes. > I use PigServer to submit map reduce job on the just closed file. These > files have data in text format and are very large in size. I am not using > any compression currently but would like to explore as amount of data is > increasing day-by-day. > I need to use a compression while writing data to hadoop and make pig > aware of this compression while submitting map reduce jobs. I am looking > for some guidance to understand my options. > Thanks,Rakesh |