|
|
-
zlib does not uncompress gzip during MR run
Jonathan Bishop 2012-10-23, 18:10
Hi,
My input files are gzipped, and I am using the builtin java codecs successfully to uncompress them in a normal java run...
fileIn = fs.open(fsplit.getPath()); codec = compressionCodecs.getCodec(fsplit.getPath()); in = new LineReader(codec != null ? codec.createInputStream(fileIn) : fileIn, config);
But when I use the same piece of code in a MR job I am getting...
12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table output configured. 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process : 3 12/10/23 11:02:27 INFO mapred.JobClient: Running job: job_201210221549_0014 12/10/23 11:02:28 INFO mapred.JobClient: map 0% reduce 0% 12/10/23 11:02:49 INFO mapred.JobClient: Task Id : attempt_201210221549_0014_m_000003_0, Status : FAILED java.io.IOException: incorrect header check at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native Method) at org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76) at java.io.InputStream.read(InputStream.java:101)
So I am thinking that there is some incompatibility of zlib and my gzip. Is there a way to force hadoop to use the java built-in compression codecs?
Also, I would like to try lzo which I hope will allow splitting of the input files (I recall reading this somewhere). Can someone point me to the best way to do this?
Thanks,
Jon
-
Re: zlib does not uncompress gzip during MR run
Jonathan Bishop 2012-10-23, 18:41
Just to follow up on my own question...
I believe the problem is caused by the input split during MR. So my real question is how to handle input splits when the input is gzipped.
Is it even possible to have splits of a gzipped file?
Thanks,
Jon
On Tue, Oct 23, 2012 at 11:10 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote:
> Hi, > > My input files are gzipped, and I am using the builtin java codecs > successfully to uncompress them in a normal java run... > > fileIn = fs.open(fsplit.getPath()); > codec = compressionCodecs.getCodec(fsplit.getPath()); > in = new LineReader(codec != null ? > codec.createInputStream(fileIn) : fileIn, config); > > But when I use the same piece of code in a MR job I am getting... > > > > 12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop > library > 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized > native-zlib library > 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor > 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table > output configured. > 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process > : 3 > 12/10/23 11:02:27 INFO mapred.JobClient: Running job: job_201210221549_0014 > 12/10/23 11:02:28 INFO mapred.JobClient: map 0% reduce 0% > 12/10/23 11:02:49 INFO mapred.JobClient: Task Id : > attempt_201210221549_0014_m_000003_0, Status : FAILED > java.io.IOException: incorrect header check > at > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native > Method) > at > org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221) > at > org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82) > at > org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76) > at java.io.InputStream.read(InputStream.java:101) > > So I am thinking that there is some incompatibility of zlib and my gzip. > Is there a way to force hadoop to use the java built-in compression codecs? > > Also, I would like to try lzo which I hope will allow splitting of the > input files (I recall reading this somewhere). Can someone point me to the > best way to do this? > > Thanks, > > Jon >
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext