Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Making gzip splittable for Hadoop


Copy link to this message
-
Making gzip splittable for Hadoop
Hi,

In many Hadoop production environments you get gzipped files as the raw
input. Usually these are Apache HTTPD logfiles. When putting these gzipped
files into Hadoop you are stuck with exactly 1 map task per input file. In
many scenarios this is fine. However when doing a lot of work in this very
first map task it may very well be advantageous to dividing the work over
multiple tasks, even if there is a penalty for this scaling out.

I've created an addon for Hadoop that makes this possible.

I've reworked the patch I initially created to be included in hadoop (see
HADOOP-7076).
It can now be used by simply adding a jar file to the classpath of an
existing Hadoop installation.

I put the code on github ( https://github.com/nielsbasjes/splittablegzip )
and (for now) the description on my homepage:
http://niels.basjes.nl/splittable-gzip

This feature only works with Hadoop 0.21 and up (I tested it with Cloudera
CDH4b1).
So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823).

Running "mvn package" automatically generates an RPM on my CentOS system.

Have fun with it an let me know what you think.

--
Best regards / Met vriendelijke groeten,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB