Do your files carry the .gz extension? They shouldn't have been split if so.
However, if I look at your code, you do not seem to be using the
fileSplit object's offset and length attributes anywhere. So, this
doesn't look like a problem of splits to me.
You're loading up the wrong codec unintentionally. Only .gz
filename suffixes map to GzipCodec if you use
instantiate the GzipCodec directly and use that instead of the
getCodec helper call.
On Wed, Oct 24, 2012 at 12:11 AM, Jonathan Bishop <[EMAIL PROTECTED]> wrote:
> Just to follow up on my own question...
> I believe the problem is caused by the input split during MR. So my real
> question is how to handle input splits when the input is gzipped.
> Is it even possible to have splits of a gzipped file?
> On Tue, Oct 23, 2012 at 11:10 AM, Jonathan Bishop <[EMAIL PROTECTED]>
>> My input files are gzipped, and I am using the builtin java codecs
>> successfully to uncompress them in a normal java run...
>> fileIn = fs.open(fsplit.getPath());
>> codec = compressionCodecs.getCodec(fsplit.getPath());
>> in = new LineReader(codec != null ?
>> codec.createInputStream(fileIn) : fileIn, config);
>> But when I use the same piece of code in a MR job I am getting...
>> 12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
>> native-zlib library
>> 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor
>> 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table
>> output configured.
>> 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process
>> : 3
>> 12/10/23 11:02:27 INFO mapred.JobClient: Running job:
>> 12/10/23 11:02:28 INFO mapred.JobClient: map 0% reduce 0%
>> 12/10/23 11:02:49 INFO mapred.JobClient: Task Id :
>> attempt_201210221549_0014_m_000003_0, Status : FAILED
>> java.io.IOException: incorrect header check
>> at java.io.InputStream.read(InputStream.java:101)
>> So I am thinking that there is some incompatibility of zlib and my gzip.
>> Is there a way to force hadoop to use the java built-in compression codecs?
>> Also, I would like to try lzo which I hope will allow splitting of the
>> input files (I recall reading this somewhere). Can someone point me to the
>> best way to do this?