Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Processing compressed files in Hadoop


Copy link to this message
-
Re: Processing compressed files in Hadoop
Hi Leo
      Irrespective of your your output/input format in your mapreduce job you can get compressed output by setting the following parameters
Mapred.output.compress=true
Mapred.output.compression.codec=yourCompressionCodec.class

LZO is better I guess.

Now if you want to index LZO it is straight forward, n you need to enable

Io.compression.codec= <append the codec class>

If you have these enabled just use any available InputFormat like you'd do with a uncompressed file. No specific input format required for your MR jobs on compressed data.

Hope it helps!..
Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Leonardo Urbina <[EMAIL PROTECTED]>
Sender: [EMAIL PROTECTED]
Date: Wed, 8 Feb 2012 12:39:54
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Processing compressed files in Hadoop

Hello everyone,

I run a daily job that takes files in a variety of different formats and
process them using several custom InputFormats which are specified using
MultipleInputs. The results get aggregated into a single SequenceFile.
Furthermore this SequenceFile is used as part of the input for the next
day's job. I run all of this in Amazon's EMR. Now, I would like to be able
to use compression in order to save on storage, however after looking
around online I have hit some dead ends:

1) I would like to compress my input files, and Hadoop gives me three
choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
they cannot be made splittable. LZO on the other hand can be indexed,
however as far as I could tell, I would be forced to use LzoTextInputFormat
in order to get Hadoop to properly decompress and read the files. Most of
my input cannot use TextInputFormat (my inputs include multi-line records,
XML files, among other things). My question is, is it possible to use LZO
with custom InputFormats?

2) I am also interested in compressing the output SequenceFile. I know this
can be done by setting

FileOutputFormat.setCompressOutput(conf, true)

If I were using TextOutputFormat, the output would be a gzipped text file.
However, being a SequenceFile it seems to be internally compressed and the
compression scheme is not immediately apparent to me. Is it possible to
specify LZO as the compression? Also, since I will be using the output as
part of the next input, do I need to index the output as a separate task?
And finally, when I specify the input format for the next day (and this
goes back to my first question), what InputFormat should I specify? I
haven't been able to find something like LzoSequenceInputFormat or anything
of the like.

Am I missing something? Any help would be greatly appreciated. Best,
-Leo

--
Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics
[EMAIL PROTECTED]

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB