Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Processing compressed files in Hadoop


Copy link to this message
-
Re: Processing compressed files in Hadoop
Hi Leo
       You can index the LZO files as

//Run theLZO indexer on files in hdfs
LzoIndexer indexer = new LzoIndexer(fs.getConf());
indexer.index(filePath);

Regards
Bejoy.K.S

On Wed, Feb 8, 2012 at 11:26 PM, Tim Broberg <[EMAIL PROTECTED]> wrote:

> Leo, splittable bzip is available
>  ...in versions > 0.21 - https://issues.apache.org/jira/browse/HADOOP-4012
>  ...or as a patch for 1.0.0, to be included in 1.1.0 -
> https://issues.apache.org/jira/browse/HADOOP-7823
>
> There is a 48-bit signature in the bzip header, and they search for this
> at all bit alignments.
>
> It's not fast, but it's there.
>
>    - Tim.
>
> ________________________________________
> From: [EMAIL PROTECTED] [[EMAIL PROTECTED]] On Behalf Of
> Leonardo Urbina [[EMAIL PROTECTED]]
> Sent: Wednesday, February 08, 2012 9:39 AM
> To: [EMAIL PROTECTED]
> Subject: Processing compressed files in Hadoop
>
> Hello everyone,
>
> I run a daily job that takes files in a variety of different formats and
> process them using several custom InputFormats which are specified using
> MultipleInputs. The results get aggregated into a single SequenceFile.
> Furthermore this SequenceFile is used as part of the input for the next
> day's job. I run all of this in Amazon's EMR. Now, I would like to be able
> to use compression in order to save on storage, however after looking
> around online I have hit some dead ends:
>
> 1) I would like to compress my input files, and Hadoop gives me three
> choices: gzip, bzip2 and LZO. I want to steer away from gzip and bzip2 as
> they cannot be made splittable. LZO on the other hand can be indexed,
> however as far as I could tell, I would be forced to use LzoTextInputFormat
> in order to get Hadoop to properly decompress and read the files. Most of
> my input cannot use TextInputFormat (my inputs include multi-line records,
> XML files, among other things). My question is, is it possible to use LZO
> with custom InputFormats?
>
> 2) I am also interested in compressing the output SequenceFile. I know this
> can be done by setting
>
> FileOutputFormat.setCompressOutput(conf, true)
>
> If I were using TextOutputFormat, the output would be a gzipped text file.
> However, being a SequenceFile it seems to be internally compressed and the
> compression scheme is not immediately apparent to me. Is it possible to
> specify LZO as the compression? Also, since I will be using the output as
> part of the next input, do I need to index the output as a separate task?
> And finally, when I specify the input format for the next day (and this
> goes back to my first question), what InputFormat should I specify? I
> haven't been able to find something like LzoSequenceInputFormat or anything
> of the like.
>
> Am I missing something? Any help would be greatly appreciated. Best,
> -Leo
>
> --
> Leo Urbina
> Massachusetts Institute of Technology
> Department of Electrical Engineering and Computer Science
> Department of Mathematics
> [EMAIL PROTECTED]
>
> The information and any attached documents contained in this message
> may be confidential and/or legally privileged.  The message is
> intended solely for the addressee(s).  If you are not the intended
> recipient, you are hereby notified that any use, dissemination, or
> reproduction is strictly prohibited and may be unlawful.  If you are
> not the intended recipient, please contact the sender immediately by
> return e-mail and destroy all copies of the original message.
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB