Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Processing compressed files in Hadoop


Copy link to this message
-
Re: Processing compressed files in Hadoop
Leo
      It should work with your custom Input Formats as well. Add the Lzo codec class to io.compression.codecs and try.

Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Leonardo Urbina <[EMAIL PROTECTED]>
Sender: [EMAIL PROTECTED]
Date: Wed, 8 Feb 2012 13:57:42
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Processing compressed files in Hadoop

Hi Bejoy,

Thanks for your response. I know how to index Lzo files, however I am
curious on whether I can still use my custom InputFormats to process the
compressed LZO files or if I have to implement new ones to handle this
case.
-Leo
On Wed, Feb 8, 2012 at 1:33 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Hi Leo
>       You can index the LZO files as
>
> //Run theLZO indexer on files in hdfs
> LzoIndexer indexer = new LzoIndexer(fs.getConf());
> indexer.index(filePath);
>
> Regards
> Bejoy.K.S
>
> On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <[EMAIL PROTECTED]> wrote:
>
> > Leo, splittable bzip is available
> >  ...in versions > 0.21 -
> https://issues.apache.org/jira/browse/HADOOP-4012
> >  ...or as a patch for 1.0.0, to be included in 1.1.0 -
> > https://issues.apache.org/jira/browse/HADOOP-7823
> >
> > There is a 48-bit signature in the bzip header, and they search for this
> > at all bit alignments.
> >
> > It's not fast, but it's there.
> >
> >    - Tim.
> >
> > ________________________________________
> > From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of
> > Leonardo Urbina [[EMAIL PROTECTED]]
> > Sent: Wednesday, February 08, 2012 9:39 AM
> > To: [EMAIL PROTECTED]
> > Subject: Processing compressed files in Hadoop
> >
> > Hello everyone,
> >
> > I run a daily job that takes files in a variety of different formats and
> > process them using several custom InputFormats which are specified using
> > MultipleInputs. The results get aggregated into a single SequenceFile.
> > Furthermore this SequenceFile is used as part of the input for the next
> > day's job. I run all of this in Amazon's EMR. Now, I would like to be
> able
> > to use compression in order to save on storage, however after looking
> > around online I have hit some dead ends:
> >
> > 1) I would like to compress my input files, and Hadoop gives me three
> > choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
> > they cannot be made splittable. LZO on the other hand can be indexed,
> > however as far as I could tell, I would be forced to use
> LzoTextInputFormat
> > in order to get Hadoop to properly decompress and read the files. Most of
> > my input cannot use TextInputFormat (my inputs include multi-line
> records,
> > XML files, among other things). My question is, is it possible to use LZO
> > with custom InputFormats?
> >
> > 2) I am also interested in compressing the output SequenceFile. I know
> this
> > can be done by setting
> >
> > FileOutputFormat.setCompressOutput(conf, true)
> >
> > If I were using TextOutputFormat, the output would be a gzipped text
> file.
> > However, being a SequenceFile it seems to be internally compressed and
> the
> > compression scheme is not immediately apparent to me. Is it possible to
> > specify LZO as the compression? Also, since I will be using the output as
> > part of the next input, do I need to index the output as a separate task?
> > And finally, when I specify the input format for the next day (and this
> > goes back to my first question), what InputFormat should I specify? I
> > haven't been able to find something like LzoSequenceInputFormat or
> anything
> > of the like.
> >
> > Am I missing something? Any help would be greatly appreciated. Best,
> > -Leo
> >
> > --
> > Leo Urbina
> > Massachusetts Institute of Technology
> > Department of Electrical Engineering and Computer Science
> > Department of Mathematics
> > [EMAIL PROTECTED]
> >
> > The information and any attached documents contained in this message
> > may be confidential and/or legally privileged.  The message is

Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics
[EMAIL PROTECTED]