-Re: Processing compressed files in Hadoop
bejoy.hadoop@... 2012-02-08, 17:53
Irrespective of your your output/input format in your mapreduce job you can get compressed output by setting the following parameters
LZO is better I guess.
Now if you want to index LZO it is straight forward, n you need to enable
Io.compression.codec= <append the codec class>
If you have these enabled just use any available InputFormat like you'd do with a uncompressed file. No specific input format required for your MR jobs on compressed data.
Hope it helps!..
Bejoy K S
From handheld, Please excuse typos.
From: Leonardo Urbina <[EMAIL PROTECTED]>
Sender: [EMAIL PROTECTED]
Date: Wed, 8 Feb 2012 12:39:54
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Processing compressed files in Hadoop
I run a daily job that takes files in a variety of different formats and
process them using several custom InputFormats which are specified using
MultipleInputs. The results get aggregated into a single SequenceFile.
Furthermore this SequenceFile is used as part of the input for the next
day's job. I run all of this in Amazon's EMR. Now, I would like to be able
to use compression in order to save on storage, however after looking
around online I have hit some dead ends:
1) I would like to compress my input files, and Hadoop gives me three
choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
they cannot be made splittable. LZO on the other hand can be indexed,
however as far as I could tell, I would be forced to use LzoTextInputFormat
in order to get Hadoop to properly decompress and read the files. Most of
my input cannot use TextInputFormat (my inputs include multi-line records,
XML files, among other things). My question is, is it possible to use LZO
with custom InputFormats?
2) I am also interested in compressing the output SequenceFile. I know this
can be done by setting
If I were using TextOutputFormat, the output would be a gzipped text file.
However, being a SequenceFile it seems to be internally compressed and the
compression scheme is not immediately apparent to me. Is it possible to
specify LZO as the compression? Also, since I will be using the output as
part of the next input, do I need to index the output as a separate task?
And finally, when I specify the input format for the next day (and this
goes back to my first question), what InputFormat should I specify? I
haven't been able to find something like LzoSequenceInputFormat or anything
of the like.
Am I missing something? Any help would be greatly appreciated. Best,
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics