1. These files will probably be some standard format like .gz or .bz2 or .zip. In that case, pick an appropriate InputFormat. See e.g. http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/, http://stackoverflow.com/questions/14497572/reading-gzipped-file-in-hadoop-using-custom-recordreader
2. Generally, compression is a Good Thing and will improve performance. But only if you use a fast compressor like LZO or Snappy. Gzip, ZIP, BZ2, etc are no good for this. You also need to ensure that your compressed files are "splittable" if you are going to create a single file that will be processed by a later MR stage, for this a SequenceFile is helpful. For typical intermediate outputs it doesn't matter as much because you will have a folder of file parts and these are "pre split" in some sense. Once upon a time, LZO compression was a thing that you had to install as a separate component, but I think the modern distros include it. See for example: http://kickstarthadoop.blogspot.com/2012/02/use-compression-with-mapreduce.html , http://blog.cloudera.com/blog/2009/05/10-mapreduce-tips/, http://my.safaribooksonline.com/book/software-engineering-and-development/9781449328917/compression/id3689058, https://www.inkling.com/read/hadoop-definitive-guide-tom-white-3rd/chapter-4/compression (section 4.2 in the Elephant book).
From: Geelong Yao [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 20, 2013 12:30 AM
To: [EMAIL PROTECTED]
Subject: some idea about the Data Compression
Hi , everyone
I am working on the data compression
1.data compression before the raw data were uploaded into HDFS.
2.data compression while processing in Hadoop to reduce the pressure on IO.
Can anyone give me some ideas on above 2 directions
>From Good To Great