You can index the LZO files as
//Run theLZO indexer on files in hdfs
LzoIndexer indexer = new LzoIndexer(fs.getConf());
On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <[EMAIL PROTECTED]> wrote:
> Leo, splittable bzip is available
> ...in versions > 0.21 - https://issues.apache.org/jira/browse/HADOOP-4012
> ...or as a patch for 1.0.0, to be included in 1.1.0 -
> There is a 48-bit signature in the bzip header, and they search for this
> at all bit alignments.
> It's not fast, but it's there.
> - Tim.
> From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of
> Leonardo Urbina [[EMAIL PROTECTED]]
> Sent: Wednesday, February 08, 2012 9:39 AM
> To: [EMAIL PROTECTED]
> Subject: Processing compressed files in Hadoop
> Hello everyone,
> I run a daily job that takes files in a variety of different formats and
> process them using several custom InputFormats which are specified using
> MultipleInputs. The results get aggregated into a single SequenceFile.
> Furthermore this SequenceFile is used as part of the input for the next
> day's job. I run all of this in Amazon's EMR. Now, I would like to be able
> to use compression in order to save on storage, however after looking
> around online I have hit some dead ends:
> 1) I would like to compress my input files, and Hadoop gives me three
> choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
> they cannot be made splittable. LZO on the other hand can be indexed,
> however as far as I could tell, I would be forced to use LzoTextInputFormat
> in order to get Hadoop to properly decompress and read the files. Most of
> my input cannot use TextInputFormat (my inputs include multi-line records,
> XML files, among other things). My question is, is it possible to use LZO
> with custom InputFormats?
> 2) I am also interested in compressing the output SequenceFile. I know this
> can be done by setting
> FileOutputFormat.setCompressOutput(conf, true)
> If I were using TextOutputFormat, the output would be a gzipped text file.
> However, being a SequenceFile it seems to be internally compressed and the
> compression scheme is not immediately apparent to me. Is it possible to
> specify LZO as the compression? Also, since I will be using the output as
> part of the next input, do I need to index the output as a separate task?
> And finally, when I specify the input format for the next day (and this
> goes back to my first question), what InputFormat should I specify? I
> haven't been able to find something like LzoSequenceInputFormat or anything
> of the like.
> Am I missing something? Any help would be greatly appreciated. Best,
> Leo Urbina
> Massachusetts Institute of Technology
> Department of Electrical Engineering and Computer Science
> Department of Mathematics
> [EMAIL PROTECTED]
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged. The message is
> intended solely for the addressee(s). If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful. If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.