Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> zlib does not uncompress gzip during MR run


Copy link to this message
-
Re: zlib does not uncompress gzip during MR run
Just to follow up on my own question...

I believe the problem is caused by the input split during MR. So my real
question is how to handle input splits when the input is gzipped.

Is it even possible to have splits of a gzipped file?

Thanks,

Jon

On Tue, Oct 23, 2012 at 11:10 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote:

> Hi,
>
> My input files are gzipped, and I am using the builtin java codecs
> successfully to uncompress them in a normal java run...
>
>         fileIn = fs.open(fsplit.getPath());
>         codec = compressionCodecs.getCodec(fsplit.getPath());
>         in = new LineReader(codec != null ?
> codec.createInputStream(fileIn) : fileIn, config);
>
> But when I use the same piece of code in a MR job I am getting...
>
>
>
> 12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
> library
> 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
> native-zlib library
> 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor
> 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table
> output configured.
> 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process
> : 3
> 12/10/23 11:02:27 INFO mapred.JobClient: Running job: job_201210221549_0014
> 12/10/23 11:02:28 INFO mapred.JobClient:  map 0% reduce 0%
> 12/10/23 11:02:49 INFO mapred.JobClient: Task Id :
> attempt_201210221549_0014_m_000003_0, Status : FAILED
> java.io.IOException: incorrect header check
>     at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
> Method)
>     at
> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
>     at
> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
>     at
> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
>     at java.io.InputStream.read(InputStream.java:101)
>
> So I am thinking that there is some incompatibility of zlib and my gzip.
> Is there a way to force hadoop to use the java built-in compression codecs?
>
> Also, I would like to try lzo which I hope will allow splitting of the
> input files (I recall reading this somewhere). Can someone point me to the
> best way to do this?
>
> Thanks,
>
> Jon
>