Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Processing compressed files in Hadoop

Copy link to this message
Re: Processing compressed files in Hadoop
Hi Leo
      Irrespective of your your output/input format in your mapreduce job you can get compressed output by setting the following parameters

LZO is better I guess.

Now if you want to index LZO it is straight forward, n you need to enable

Io.compression.codec= <append the codec class>

If you have these enabled just use any available InputFormat like you'd do with a uncompressed file. No specific input format required for your MR jobs on compressed data.

Hope it helps!..
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Leonardo Urbina <[EMAIL PROTECTED]>
Date: Wed, 8 Feb 2012 12:39:54
Subject: Processing compressed files in Hadoop

Hello everyone,

I run a daily job that takes files in a variety of different formats and
process them using several custom InputFormats which are specified using
MultipleInputs. The results get aggregated into a single SequenceFile.
Furthermore this SequenceFile is used as part of the input for the next
day's job. I run all of this in Amazon's EMR. Now, I would like to be able
to use compression in order to save on storage, however after looking
around online I have hit some dead ends:

1) I would like to compress my input files, and Hadoop gives me three
choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
they cannot be made splittable. LZO on the other hand can be indexed,
however as far as I could tell, I would be forced to use LzoTextInputFormat
in order to get Hadoop to properly decompress and read the files. Most of
my input cannot use TextInputFormat (my inputs include multi-line records,
XML files, among other things). My question is, is it possible to use LZO
with custom InputFormats?

2) I am also interested in compressing the output SequenceFile. I know this
can be done by setting

FileOutputFormat.setCompressOutput(conf, true)

If I were using TextOutputFormat, the output would be a gzipped text file.
However, being a SequenceFile it seems to be internally compressed and the
compression scheme is not immediately apparent to me. Is it possible to
specify LZO as the compression? Also, since I will be using the output as
part of the next input, do I need to index the output as a separate task?
And finally, when I specify the input format for the next day (and this
goes back to my first question), what InputFormat should I specify? I
haven't been able to find something like LzoSequenceInputFormat or anything
of the like.

Am I missing something? Any help would be greatly appreciated. Best,

Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics