-Re: CompressionCodec in MapReduce
Zizon Qiu 2012-04-11, 08:44
If your are:
1. using TextInputFormat.
2.all input files are ends with certain suffix like ".gz"
3.the custom CompressionCodec already register in configuration and
getDefaultExtension return the same suffix like as describe in 2.
the nothing else you need to do.
hadoop will deal with it automatically.
that means the input key&value in map method are already decompress.
But,if the origin files dose not end with certain suffix,you need to write
your own inputformat or subclass TextInputFormat , override the
createRecordReader method which return your own RecordReader.
the InputSplit pass to the InputFormat is actually FileInputSplit,which you
can retrieve the input file path.
you may also take a look at the isSplitable method declared
in InputFormat,if your files are not splitable.
for more detail,refer to the TextInputFormat class implementation.
On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia
> Thanks for you reply! That clears some thing up
> There is but one problem... My CompressionCodec has to be instantiated on
> a per-file basis, meaning it needs to know the name of the file it is to
> compress/decompress. I'm guessing that would not be possible with the
> current implementation?
> Or if so, how would I proceed with injecting it with the file name?
> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
> append your custom codec full class name in "io.compression.codecs" either
> in mapred-site.xml or in the configuration object pass to Job constructor.
> the map reduce framework will try to guess the compress algorithm using
> the input files suffix.
> if any CompressionCodec.getDefaultExtension() register in the
> configuration match the suffix,hadoop will try to instantiate the codec and
> decompress for you ,if succeed,automatically.
> the default value for "io.compression.codecs" is
> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
> [EMAIL PROTECTED]> wrote:
>> I am trying to apply a custom CompressionCodec to work with MapReduce
>> jobs, but I haven't found a way to inject it during the reading of input
>> data, or during the write of the job results.
>> Am I missing something, or is there no support for compressed files in
>> the filesystem?
>> I am well aware of how to set it up to work during the intermitent phases
>> of the MapReduce operation, but I just can't find a way to apply it BEFORE
>> the job takes place...
>> Is there any other way except simply uncompressing the files I need prior
>> to scheduling a job?
>> Huge thanks for any help you can give me!