Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Making gzip splittable for Hadoop

Copy link to this message
Making gzip splittable for Hadoop

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.

I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running "mvn package" automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

Best regards / Met vriendelijke groeten,

Niels Basjes