Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> zlib does not uncompress gzip during MR run


Copy link to this message
-
zlib does not uncompress gzip during MR run
Hi,

My input files are gzipped, and I am using the builtin java codecs
successfully to uncompress them in a normal java run...

        fileIn = fs.open(fsplit.getPath());
        codec = compressionCodecs.getCodec(fsplit.getPath());
        in = new LineReader(codec != null ? codec.createInputStream(fileIn)
: fileIn, config);

But when I use the same piece of code in a MR job I am getting...

12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
library
12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
native-zlib library
12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor
12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table
output configured.
12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process
: 3
12/10/23 11:02:27 INFO mapred.JobClient: Running job: job_201210221549_0014
12/10/23 11:02:28 INFO mapred.JobClient:  map 0% reduce 0%
12/10/23 11:02:49 INFO mapred.JobClient: Task Id :
attempt_201210221549_0014_m_000003_0, Status : FAILED
java.io.IOException: incorrect header check
    at
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
Method)
    at
org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
    at
org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
    at
org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
    at java.io.InputStream.read(InputStream.java:101)

So I am thinking that there is some incompatibility of zlib and my gzip. Is
there a way to force hadoop to use the java built-in compression codecs?

Also, I would like to try lzo which I hope will allow splitting of the
input files (I recall reading this somewhere). Can someone point me to the
best way to do this?

Thanks,

Jon