-Re: CompressionCodec in MapReduce
Zizon Qiu 2012-04-11, 14:05
It is possible but a little tricky.
As I mention before,write a custom InputFormat and the associate
On Wed, Apr 11, 2012 at 5:23 PM, Grzegorz Gunia
> I think we misunderstood here.
> I'll base my question upon an example:
> Lets say I want each of the files stored on my hdfs to be encrypted prior
> to being physically stored on the cluster.
> For that I'll write a custom CompressionCodec, that performs the
> encryption, and use it during any edits/creations of files in the HDFS.
> Then to make it more secure I'll make it so it uses different keys for
> different files, and supply the keys to the codec during its instantiation.
> Now I'd like to do a MapReduce job on those files. That would require
> instantiating the codec, and supplying it with the filename, to determine
> the key used. Is it possible to do so with the current implementation of
> W dniu 2012-04-11 10:44, Zizon Qiu pisze:
> If your are:
> 1. using TextInputFormat.
> 2.all input files are ends with certain suffix like ".gz"
> 3.the custom CompressionCodec already register in configuration and
> getDefaultExtension return the same suffix like as describe in 2.
> the nothing else you need to do.
> hadoop will deal with it automatically.
> that means the input key&value in map method are already decompress.
> But,if the origin files dose not end with certain suffix,you need
> to write your own inputformat or subclass TextInputFormat , override the
> createRecordReader method which return your own RecordReader.
> the InputSplit pass to the InputFormat is actually FileInputSplit,which
> you can retrieve the input file path.
> you may also take a look at the isSplitable method declared
> in InputFormat,if your files are not splitable.
> for more detail,refer to the TextInputFormat class implementation.
> On Wed, Apr 11, 2012 at 4:16 PM, Grzegorz Gunia <
> [EMAIL PROTECTED]> wrote:
>> Thanks for you reply! That clears some thing up
>> There is but one problem... My CompressionCodec has to be instantiated on
>> a per-file basis, meaning it needs to know the name of the file it is to
>> compress/decompress. I'm guessing that would not be possible with the
>> current implementation?
>> Or if so, how would I proceed with injecting it with the file name?
>> W dniu 2012-04-11 10:12, Zizon Qiu pisze:
>> append your custom codec full class name in "io.compression.codecs"
>> either in mapred-site.xml or in the configuration object pass to Job
>> the map reduce framework will try to guess the compress algorithm using
>> the input files suffix.
>> if any CompressionCodec.getDefaultExtension() register in the
>> configuration match the suffix,hadoop will try to instantiate the codec and
>> decompress for you ,if succeed,automatically.
>> the default value for "io.compression.codecs" is
>> On Wed, Apr 11, 2012 at 3:55 PM, Grzegorz Gunia <
>> [EMAIL PROTECTED]> wrote:
>>> I am trying to apply a custom CompressionCodec to work with MapReduce
>>> jobs, but I haven't found a way to inject it during the reading of input
>>> data, or during the write of the job results.
>>> Am I missing something, or is there no support for compressed files in
>>> the filesystem?
>>> I am well aware of how to set it up to work during the intermitent
>>> phases of the MapReduce operation, but I just can't find a way to apply it
>>> BEFORE the job takes place...
>>> Is there any other way except simply uncompressing the files I need
>>> prior to scheduling a job?
>>> Huge thanks for any help you can give me!