|
|
-
Re: Processing compressed files in HadoopBejoy Ks 2012-02-08, 18:33
Hi Leo
You can index the LZO files as //Run theLZO indexer on files in hdfs LzoIndexer indexer = new LzoIndexer(fs.getConf()); indexer.index(filePath); Regards Bejoy.K.S On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <[EMAIL PROTECTED]> wrote: > Leo, splittable bzip is available > ...in versions > 0.21 - https://issues.apache.org/jira/browse/HADOOP-4012 > ...or as a patch for 1.0.0, to be included in 1.1.0 - > https://issues.apache.org/jira/browse/HADOOP-7823 > > There is a 48-bit signature in the bzip header, and they search for this > at all bit alignments. > > It's not fast, but it's there. > > - Tim. > > ________________________________________ > From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of > Leonardo Urbina [[EMAIL PROTECTED]] > Sent: Wednesday, February 08, 2012 9:39 AM > To: [EMAIL PROTECTED] > Subject: Processing compressed files in Hadoop > > Hello everyone, > > I run a daily job that takes files in a variety of different formats and > process them using several custom InputFormats which are specified using > MultipleInputs. The results get aggregated into a single SequenceFile. > Furthermore this SequenceFile is used as part of the input for the next > day's job. I run all of this in Amazon's EMR. Now, I would like to be able > to use compression in order to save on storage, however after looking > around online I have hit some dead ends: > > 1) I would like to compress my input files, and Hadoop gives me three > choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as > they cannot be made splittable. LZO on the other hand can be indexed, > however as far as I could tell, I would be forced to use LzoTextInputFormat > in order to get Hadoop to properly decompress and read the files. Most of > my input cannot use TextInputFormat (my inputs include multi-line records, > XML files, among other things). My question is, is it possible to use LZO > with custom InputFormats? > > 2) I am also interested in compressing the output SequenceFile. I know this > can be done by setting > > FileOutputFormat.setCompressOutput(conf, true) > > If I were using TextOutputFormat, the output would be a gzipped text file. > However, being a SequenceFile it seems to be internally compressed and the > compression scheme is not immediately apparent to me. Is it possible to > specify LZO as the compression? Also, since I will be using the output as > part of the next input, do I need to index the output as a separate task? > And finally, when I specify the input format for the next day (and this > goes back to my first question), what InputFormat should I specify? I > haven't been able to find something like LzoSequenceInputFormat or anything > of the like. > > Am I missing something? Any help would be greatly appreciated. Best, > -Leo > > -- > Leo Urbina > Massachusetts Institute of Technology > Department of Electrical Engineering and Computer Science > Department of Mathematics > [EMAIL PROTECTED] > > The information and any attached documents contained in this message > may be confidential and/or legally privileged. The message is > intended solely for the addressee(s). If you are not the intended > recipient, you are hereby notified that any use, dissemination, or > reproduction is strictly prohibited and may be unlawful. If you are > not the intended recipient, please contact the sender immediately by > return e-mail and destroy all copies of the original message. > |