|
|
-
Re: Processing compressed files in Hadoopbejoy.hadoop@... 2012-02-08, 17:53
Hi Leo
Irrespective of your your output/input format in your mapreduce job you can get compressed output by setting the following parameters Mapred.output.compress=true Mapred.output.compression.codec=yourCompressionCodec.class LZO is better I guess. Now if you want to index LZO it is straight forward, n you need to enable Io.compression.codec= <append the codec class> If you have these enabled just use any available InputFormat like you'd do with a uncompressed file. No specific input format required for your MR jobs on compressed data. Hope it helps!.. Regards Bejoy K S From handheld, Please excuse typos. -----Original Message----- From: Leonardo Urbina <[EMAIL PROTECTED]> Sender: [EMAIL PROTECTED] Date: Wed, 8 Feb 2012 12:39:54 To: <[EMAIL PROTECTED]> Reply-To: [EMAIL PROTECTED] Subject: Processing compressed files in Hadoop Hello everyone, I run a daily job that takes files in a variety of different formats and process them using several custom InputFormats which are specified using MultipleInputs. The results get aggregated into a single SequenceFile. Furthermore this SequenceFile is used as part of the input for the next day's job. I run all of this in Amazon's EMR. Now, I would like to be able to use compression in order to save on storage, however after looking around online I have hit some dead ends: 1) I would like to compress my input files, and Hadoop gives me three choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as they cannot be made splittable. LZO on the other hand can be indexed, however as far as I could tell, I would be forced to use LzoTextInputFormat in order to get Hadoop to properly decompress and read the files. Most of my input cannot use TextInputFormat (my inputs include multi-line records, XML files, among other things). My question is, is it possible to use LZO with custom InputFormats? 2) I am also interested in compressing the output SequenceFile. I know this can be done by setting FileOutputFormat.setCompressOutput(conf, true) If I were using TextOutputFormat, the output would be a gzipped text file. However, being a SequenceFile it seems to be internally compressed and the compression scheme is not immediately apparent to me. Is it possible to specify LZO as the compression? Also, since I will be using the output as part of the next input, do I need to index the output as a separate task? And finally, when I specify the input format for the next day (and this goes back to my first question), what InputFormat should I specify? I haven't been able to find something like LzoSequenceInputFormat or anything of the like. Am I missing something? Any help would be greatly appreciated. Best, -Leo -- Leo Urbina Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Department of Mathematics [EMAIL PROTECTED] |