Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # dev - New Features Proposed for Apache Flume

Copy link to this message
New Features Proposed for Apache Flume
Israel Ekpo 2013-08-28, 16:06
Hello everyone,

I think it will be helpful to have the following features in Apache Flume:

I plan to open JIRA issues for these proposals tonight.

I am about to start creating patches for some of them but I would like to
know what you guys think so that I can tweak my logic accordingly without
going too far.

*When you get a chance, please take a look at them and give me some


Using the Tailer feature from Apache Commons I/O utility [1], we can tail
specific files for events.

This allows us to, regardless of the operating system, have the ability to
watch files for future events as they occur.

It also allows us to step in and determine if two or more events should be
merged into one events if newline characters are present in an event.

We can configure certain regular expressions that determines if a specific
line is a new event or part of the prevent event.

Essentially, this source will have the ability to merge multiple lines into
one event before it is passed on to interceptors.

It has been complicated group multiple lines into a single event with the
Spooling Directory Source or Exec Source. I tried creating custom
deserializers but it was hard to get around the logic used to parse the

Using the Spooling Directory also means we cannot watch the original files
so we need a background process to copy over the log files into the
spooling directory which requires additional setup.

The tail command is not also available on all operating systems out of the

With this interceptor we can parse semi-structure and unstructured text and
log data in the headers and body of the event into something structured
that can be easily queried.
I plan to use the information [2] and [3] for this.
With this interceptor, we can extract HTTP response codes, response times,
user agents, IP addresses and a whole bunch of useful data point from free
form text.


This is for IP intelligence.

This interceptor will allow us to use the value of an IP address in the
event header or body of the request to estimate the geographical location
of the IP address.

Using the database available here [4], we can inject the two-letter code or
country name of the IP address into the event.

We can also deduce other values such as city name, postalCode, latitude,
longitude, Internet Service Provider and Organization name.

This can be very helpful in analyzing traffic patterns and target audience
from webserver or application logs.

The database is loaded into a Lucene index when the agent is started up.
The index is only created once if it does not already exists.

As the interceptor comes across events, it maps the IP address to a variety
of values that can be injected into the events.


This can provide another option for setting up a fan-in and/or fan-out

The RedisSink can serve as a queue that is used as a source by another
agent down the line.

[2] https://github.com/NFLabs/java-grok
[3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
[4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
[5] http://dev.maxmind.com/geoip/legacy/csv/
[6] http://redis.io/documentation
[7] https://github.com/xetorthio/jedis

*Author and Instructor for the Upcoming Book and Lecture Series*
*Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software*
Wolfgang Hoschek 2013-08-28, 17:21
Mike Percy 2013-08-28, 18:27
Israel Ekpo 2013-09-08, 03:47
Wolfgang Hoschek 2013-11-16, 10:26
Otis Gospodnetic 2013-11-20, 03:27
Juhani Connolly 2013-08-29, 02:26