Just to follow up on my own question...
I believe the problem is caused by the input split during MR. So my real
question is how to handle input splits when the input is gzipped.
Is it even possible to have splits of a gzipped file?
On Tue, Oct 23, 2012 at 11:10 AM, Jonathan Bishop <[EMAIL PROTECTED]>wrote:
> My input files are gzipped, and I am using the builtin java codecs
> successfully to uncompress them in a normal java run...
> fileIn = fs.open(fsplit.getPath());
> codec = compressionCodecs.getCodec(fsplit.getPath());
> in = new LineReader(codec != null ?
> codec.createInputStream(fileIn) : fileIn, config);
> But when I use the same piece of code in a MR job I am getting...
> 12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
> 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
> native-zlib library
> 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor
> 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table
> output configured.
> 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process
> : 3
> 12/10/23 11:02:27 INFO mapred.JobClient: Running job: job_201210221549_0014
> 12/10/23 11:02:28 INFO mapred.JobClient: map 0% reduce 0%
> 12/10/23 11:02:49 INFO mapred.JobClient: Task Id :
> attempt_201210221549_0014_m_000003_0, Status : FAILED
> java.io.IOException: incorrect header check
> at java.io.InputStream.read(InputStream.java:101)
> So I am thinking that there is some incompatibility of zlib and my gzip.
> Is there a way to force hadoop to use the java built-in compression codecs?
> Also, I would like to try lzo which I hope will allow splitting of the
> input files (I recall reading this somewhere). Can someone point me to the
> best way to do this?