Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - CompressionCodec in MapReduce


Copy link to this message
-
RE: CompressionCodec in MapReduce
Devaraj k 2012-04-11, 08:37
Hi Grzegorz,

    You can find the below properties for Job input and output compression:

The below prop is used by the codec factory. This codec will be taken based on the type(i.e suffix) of the file. By default the LineRecordReador which is used by FileInputFormat uses this. If you want the compression for inputs in otherway you can write input format according to that.

core-site.xml:
---------------

<property>
  <name>io.compression.codecs</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,org.apache.hadoop.io.compress.SnappyCodec,org.apache.hadoop.io.compress.Lz4Codec</value>
  <description>A list of the compression codec classes that can be used
               for compression/decompression.</description>
</property>
   I am not sure which version of hadoop you are using. I am giving the props for newer and older versions. These are the props you need to configure if you want to compress job outputs. These works only when the output format is FileOutputFormat.

mapred-site.xml:(for version 0.23  and later)
---------------------------------------------------

<property>
  <name>mapreduce.output.fileoutputformat.compress</name>
  <value>false</value>
  <description>Should the job outputs be compressed?
  </description>
</property>

<property>
  <name>mapreduce.output.fileoutputformat.compression.type</name>
  <value>RECORD</value>
  <description>If the job outputs are to compressed as SequenceFiles, how should
               they be compressed? Should be one of NONE, RECORD or BLOCK.
  </description>
</property>

<property>
  <name>mapreduce.output.fileoutputformat.compression.codec</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  <description>If the job outputs are compressed, how should they be compressed?
  </description>
</property>
mapred-site.xml:(for older versions)
------------------------------------------

<property>
  <name>mapred.output.compress</name>
  <value>false</value>
  <description>Should the job outputs be compressed?
  </description>
</property>

<property>
  <name>mapred.output.compression.type</name>
  <value>RECORD</value>
  <description>If the job outputs are to compressed as SequenceFiles, how should
               they be compressed? Should be one of NONE, RECORD or BLOCK.
  </description>
</property>

<property>
  <name>mapred.output.compression.codec</name>
  <value>org.apache.hadoop.io.compress.DefaultCodec</value>
  <description>If the job outputs are compressed, how should they be compressed?
  </description>
</property>
If you want to use compression with your custom input and out formats, you can implement the compression in those classes.
Thanks
Devaraj
________________________________________
From: Grzegorz Gunia [[EMAIL PROTECTED]]
Sent: Wednesday, April 11, 2012 1:46 PM
To: [EMAIL PROTECTED]
Subject: Re: CompressionCodec in MapReduce

Thanks for you reply! That clears some thing up
There is but one problem... My CompressionCodec has to be instantiated on a per-file basis, meaning it needs to know the name of the file it is to compress/decompress. I'm guessing that would not be possible with the current implementation?

Or if so, how would I proceed with injecting it with the file name?
--
Greg

W dniu 2012-04-11 10:12, Zizon Qiu pisze:
append your custom codec full class name in "io.compression.codecs" either in mapred-site.xml or in the configuration object pass to Job constructor.

the map reduce framework will try to guess the compress algorithm using the input files suffix.

if any CompressionCodec.getDefaultExtension() register in the configuration match the suffix,hadoop will try to instantiate the codec and decompress for you ,if succeed,automatically.

the default value for "io.compression.codecs" is "org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec"

On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Hello,
I am trying to apply a custom CompressionCodec to work with MapReduce jobs, but I haven't found a way to inject it during the reading of input data, or during the write of the job results.
Am I missing something, or is there no support for compressed files in the filesystem?

I am well aware of how to set it up to work during the intermitent phases of the MapReduce operation, but I just can't find a way to apply it BEFORE the job takes place...
Is there any other way except simply uncompressing the files I need prior to scheduling a job?

Huge thanks for any help you can give me!
Greg