In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.
I've created an addon for Hadoop that makes this possible.
I've reworked the patch I initially created to be included in hadoop (see
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.
I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:
This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).
Running "mvn package" automatically generates an RPM on my CentOS system.
Have fun with it an let me know what you think.
Best regards / Met vriendelijke groeten,