Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> [Announce] Parsing Apache HTTPD logfiles made easy

Copy link to this message
[Announce] Parsing Apache HTTPD logfiles made easy

I've been working on a hobby project of mine and recently I decided to
opensource it.

What this does is that it takes the LogFormat specification directly from
the httpd config that was used to write the file, and construct a parser
that does the inverse (parsing) operation. So instead of reinventing the
most horrendus regular expression ever when someone decides to change the
logformat you can now simply copy the new LogFormat specification and
continue with minimal downtime.

The codebase includes
- an Pojo/Annotation based parser that can be used in regular Java
- a Hadoop InputFormat for use in Java MapReduce applciations
- a Pig Loader.

As said, this is a hobby project of mine and I'm using (mostly the Pig
Loader) for several of my own experiments.

Because I think it's "the best thing since sliced bread" I've become kind
of blind for the flaws.
So I would really appreciate it if you guys can let me know what you think.


Niels Basjes

Quote from the readme:

*Usage (PIG)
You simply register the httpdlog-pigloader-1.0-SNAPSHOT-job.jar

*REGISTER target/httpdlog-pigloader-1.0-SNAPSHOT-job.jar
And then call the loader with a dummy file (must exist, won't be read) and
the parameter called 'fields'. This will return a list of all possible
fields. Note that wher a '*' appears this means there are many possible
values that can appear there (for example the keys of a query string in a
URL). As you can see there is a kinda sloppy type mechanism to stear the
parsing, don't change that as the persing really relies on this.

*Fields   LOAD 'test.pig' -- Any file as long as it exists
  USING nl.basjes.pig.input.apachehttpdlog.Loader(
    '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"',
    'Fields' ) AS (fields);

DUMP Fields;
Now that we have all the possible values that CAN be produced from this
logformat we simply choose the ones we need and tell the Loader we want

*Clicks   LOAD 'access_log.gz'
  USING nl.basjes.pig.input.apachehttpdlog.Loader(
    '%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-Agent}i"',

    AS (
>From here you can do as you want with the resulting tuples. Note that
almost everything is output as a chararray, yet things that seem like
number (based on the sloppy typing) are output as longs.
Best regards / Met vriendelijke groeten,

Niels Basjes