Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Processing compressed files in Hadoop


Copy link to this message
-
Re: Processing compressed files in Hadoop
Leo
      It should work with your custom Input Formats as well. Add the Lzo codec class to io.compression.codecs and try.

Regards
Bejoy K S

From handheld, Please excuse typos.

-----Original Message-----
From: Leonardo Urbina <[EMAIL PROTECTED]>
Sender: [EMAIL PROTECTED]
Date: Wed, 8 Feb 2012 13:57:42
To: <[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Processing compressed files in Hadoop

Hi Bejoy,

Thanks for your response. I know how to index Lzo files, however I am
curious on whether I can still use my custom InputFormats to process the
compressed LZO files or if I have to implement new ones to handle this
case.
-Leo
On Wed, Feb 8, 2012 at 1:33 PM, Bejoy Ks <[EMAIL PROTECTED]> wrote:

> Hi Leo
>       You can index the LZO files as
>
> //Run theLZO indexer on files in hdfs
> LzoIndexer indexer = new LzoIndexer(fs.getConf());
> indexer.index(filePath);
>
> Regards
> Bejoy.K.S
>
> On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <[EMAIL PROTECTED]> wrote:
>
> > Leo, splittable bzip is available
> >  ...in versions > 0.21 -
> https://issues.apache.org/jira/browse/HADOOP-4012
> >  ...or as a patch for 1.0.0, to be included in 1.1.0 -
> > https://issues.apache.org/jira/browse/HADOOP-7823
> >
> > There is a 48-bit signature in the bzip header, and they search for this
> > at all bit alignments.
> >
> > It's not fast, but it's there.
> >
> >    - Tim.
> >
> > ________________________________________
> > From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of
> > Leonardo Urbina [[EMAIL PROTECTED]]
> > Sent: Wednesday, February 08, 2012 9:39 AM
> > To: [EMAIL PROTECTED]
> > Subject: Processing compressed files in Hadoop
> >
> > Hello everyone,
> >
> > I run a daily job that takes files in a variety of different formats and
> > process them using several custom InputFormats which are specified using
> > MultipleInputs. The results get aggregated into a single SequenceFile.
> > Furthermore this SequenceFile is used as part of the input for the next
> > day's job. I run all of this in Amazon's EMR. Now, I would like to be
> able
> > to use compression in order to save on storage, however after looking
> > around online I have hit some dead ends:
> >
> > 1) I would like to compress my input files, and Hadoop gives me three
> > choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
> > they cannot be made splittable. LZO on the other hand can be indexed,
> > however as far as I could tell, I would be forced to use
> LzoTextInputFormat
> > in order to get Hadoop to properly decompress and read the files. Most of
> > my input cannot use TextInputFormat (my inputs include multi-line
> records,
> > XML files, among other things). My question is, is it possible to use LZO
> > with custom InputFormats?
> >
> > 2) I am also interested in compressing the output SequenceFile. I know
> this
> > can be done by setting
> >
> > FileOutputFormat.setCompressOutput(conf, true)
> >
> > If I were using TextOutputFormat, the output would be a gzipped text
> file.
> > However, being a SequenceFile it seems to be internally compressed and
> the
> > compression scheme is not immediately apparent to me. Is it possible to
> > specify LZO as the compression? Also, since I will be using the output as
> > part of the next input, do I need to index the output as a separate task?
> > And finally, when I specify the input format for the next day (and this
> > goes back to my first question), what InputFormat should I specify? I
> > haven't been able to find something like LzoSequenceInputFormat or
> anything
> > of the like.
> >
> > Am I missing something? Any help would be greatly appreciated. Best,
> > -Leo
> >
> > --
> > Leo Urbina
> > Massachusetts Institute of Technology
> > Department of Electrical Engineering and Computer Science
> > Department of Mathematics
> > [EMAIL PROTECTED]
> >
> > The information and any attached documents contained in this message
> > may be confidential and/or legally privileged.  The message is

Leo Urbina
Massachusetts Institute of Technology
Department of Electrical Engineering and Computer Science
Department of Mathematics
[EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB