|
|
-
Re: zlib does not uncompress gzip during MR runHarsh J 2012-10-24, 04:17
Hi,
Do your files carry the .gz extension? They shouldn't have been split if so. However, if I look at your code, you do not seem to be using the fileSplit object's offset and length attributes anywhere. So, this doesn't look like a problem of splits to me. You're loading up the wrong codec unintentionally. Only .gz filename suffixes map to GzipCodec if you use "compressionCodecs.getCodec(fsplit.getPath());", otherwise, instantiate the GzipCodec directly and use that instead of the getCodec helper call. On Wed, Oct 24, 2012 at 12:11 AM, Jonathan Bishop <[EMAIL PROTECTED]> wrote: > Just to follow up on my own question... > > I believe the problem is caused by the input split during MR. So my real > question is how to handle input splits when the input is gzipped. > > Is it even possible to have splits of a gzipped file? > > Thanks, > > Jon > > > On Tue, Oct 23, 2012 at 11:10 AM, Jonathan Bishop <[EMAIL PROTECTED]> > wrote: >> >> Hi, >> >> My input files are gzipped, and I am using the builtin java codecs >> successfully to uncompress them in a normal java run... >> >> fileIn = fs.open(fsplit.getPath()); >> codec = compressionCodecs.getCodec(fsplit.getPath()); >> in = new LineReader(codec != null ? >> codec.createInputStream(fileIn) : fileIn, config); >> >> But when I use the same piece of code in a MR job I am getting... >> >> >> >> 12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop >> library >> 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized >> native-zlib library >> 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor >> 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table >> output configured. >> 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process >> : 3 >> 12/10/23 11:02:27 INFO mapred.JobClient: Running job: >> job_201210221549_0014 >> 12/10/23 11:02:28 INFO mapred.JobClient: map 0% reduce 0% >> 12/10/23 11:02:49 INFO mapred.JobClient: Task Id : >> attempt_201210221549_0014_m_000003_0, Status : FAILED >> java.io.IOException: incorrect header check >> at >> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native >> Method) >> at >> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221) >> at >> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82) >> at >> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76) >> at java.io.InputStream.read(InputStream.java:101) >> >> So I am thinking that there is some incompatibility of zlib and my gzip. >> Is there a way to force hadoop to use the java built-in compression codecs? >> >> Also, I would like to try lzo which I hope will allow splitting of the >> input files (I recall reading this somewhere). Can someone point me to the >> best way to do this? >> >> Thanks, >> >> Jon > > -- Harsh J |