Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Re: zlib does not uncompress gzip during MR run


Copy link to this message
-
Re: zlib does not uncompress gzip during MR run
Harsh J 2012-10-24, 04:17
Hi,

Do your files carry the .gz extension? They shouldn't have been split if so.

However, if I look at your code, you do not seem to be using the
fileSplit object's offset and length attributes anywhere. So, this
doesn't look like a problem of splits to me.

You're loading up the wrong codec unintentionally. Only .gz
filename suffixes map to GzipCodec if you use
"compressionCodecs.getCodec(fsplit.getPath());", otherwise,
instantiate the GzipCodec directly and use that instead of the
getCodec helper call.

On Wed, Oct 24, 2012 at 12:11 AM, Jonathan Bishop <[EMAIL PROTECTED]> wrote:
> Just to follow up on my own question...
>
> I believe the problem is caused by the input split during MR. So my real
> question is how to handle input splits when the input is gzipped.
>
> Is it even possible to have splits of a gzipped file?
>
> Thanks,
>
> Jon
>
>
> On Tue, Oct 23, 2012 at 11:10 AM, Jonathan Bishop <[EMAIL PROTECTED]>
> wrote:
>>
>> Hi,
>>
>> My input files are gzipped, and I am using the builtin java codecs
>> successfully to uncompress them in a normal java run...
>>
>>         fileIn = fs.open(fsplit.getPath());
>>         codec = compressionCodecs.getCodec(fsplit.getPath());
>>         in = new LineReader(codec != null ?
>> codec.createInputStream(fileIn) : fileIn, config);
>>
>> But when I use the same piece of code in a MR job I am getting...
>>
>>
>>
>> 12/10/23 11:02:25 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>> 12/10/23 11:02:25 INFO zlib.ZlibFactory: Successfully loaded & initialized
>> native-zlib library
>> 12/10/23 11:02:25 INFO compress.CodecPool: Got brand-new compressor
>> 12/10/23 11:02:26 INFO mapreduce.HFileOutputFormat: Incremental table
>> output configured.
>> 12/10/23 11:02:26 INFO input.FileInputFormat: Total input paths to process
>> : 3
>> 12/10/23 11:02:27 INFO mapred.JobClient: Running job:
>> job_201210221549_0014
>> 12/10/23 11:02:28 INFO mapred.JobClient:  map 0% reduce 0%
>> 12/10/23 11:02:49 INFO mapred.JobClient: Task Id :
>> attempt_201210221549_0014_m_000003_0, Status : FAILED
>> java.io.IOException: incorrect header check
>>     at
>> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.inflateBytesDirect(Native
>> Method)
>>     at
>> org.apache.hadoop.io.compress.zlib.ZlibDecompressor.decompress(ZlibDecompressor.java:221)
>>     at
>> org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:82)
>>     at
>> org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:76)
>>     at java.io.InputStream.read(InputStream.java:101)
>>
>> So I am thinking that there is some incompatibility of zlib and my gzip.
>> Is there a way to force hadoop to use the java built-in compression codecs?
>>
>> Also, I would like to try lzo which I hope will allow splitting of the
>> input files (I recall reading this somewhere). Can someone point me to the
>> best way to do this?
>>
>> Thanks,
>>
>> Jon
>
>

--
Harsh J