Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # dev >> New Features Proposed for Apache Flume


+
Israel Ekpo 2013-08-28, 16:06
Copy link to this message
-
Re: New Features Proposed for Apache Flume
Re: GrokInterceptor

This functionality is already available in the form of the Apache Flume MorphlineInterceptor [1] with the grok command [2]. While grok is very useful, consider that grok alone often isn't enough - you typically need some other log event processing commands as well, for example as contained in morphlines [3].

Re: FileSource

True file tailing would be great.

Merging multiple lines into one event can already be done with the MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a morphline directly into that new FileSource?

Re: GeoIPInterceptor

Seems to me that it would be more flexible, powerful and reusable to add this kind of functionality as a morphline command - contributions welcome!

Finally, a word of caution, Maxmind is a good geo db, and I've used it before, but it has some LGPL issues that may or may not be workable in this context. Maxmind db fits into RAM - Lucene seems like overkill here - you can do fast maxmind lookups directly without Lucene.

[1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
[2] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
[3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
[4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine

Wolfgang.

>
> *FileSource*
>
> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
> specific files for events.
>
> This allows us to, regardless of the operating system, have the ability to
> watch files for future events as they occur.
>
> It also allows us to step in and determine if two or more events should be
> merged into one events if newline characters are present in an event.
>
> We can configure certain regular expressions that determines if a specific
> line is a new event or part of the prevent event.
>
> Essentially, this source will have the ability to merge multiple lines into
> one event before it is passed on to interceptors.
>
> It has been complicated group multiple lines into a single event with the
> Spooling Directory Source or Exec Source. I tried creating custom
> deserializers but it was hard to get around the logic used to parse the
> files.
>
> Using the Spooling Directory also means we cannot watch the original files
> so we need a background process to copy over the log files into the
> spooling directory which requires additional setup.
>
> The tail command is not also available on all operating systems out of the
> box.
>
>
> *GrokInterceptor*
>
> With this interceptor we can parse semi-structure and unstructured text and
> log data in the headers and body of the event into something structured
> that can be easily queried.
> I plan to use the information [2] and [3] for this.
> With this interceptor, we can extract HTTP response codes, response times,
> user agents, IP addresses and a whole bunch of useful data point from free
> form text.
>
>
>
> *GeoIPInterceptor*
>
> This is for IP intelligence.
>
> This interceptor will allow us to use the value of an IP address in the
> event header or body of the request to estimate the geographical location
> of the IP address.
>
> Using the database available here [4], we can inject the two-letter code or
> country name of the IP address into the event.
>
> We can also deduce other values such as city name, postalCode, latitude,
> longitude, Internet Service Provider and Organization name.
>
> This can be very helpful in analyzing traffic patterns and target audience
> from webserver or application logs.
>
> The database is loaded into a Lucene index when the agent is started up.
> The index is only created once if it does not already exists.
>
> As the interceptor comes across events, it maps the IP address to a variety
> of values that can be injected into the events.
>
>
>
> *RedisSink*
>
> This can provide another option for setting up a fan-in and/or fan-out
> architecture.
+
Mike Percy 2013-08-28, 18:27
+
Israel Ekpo 2013-09-08, 03:47
+
Wolfgang Hoschek 2013-11-16, 10:26
+
Otis Gospodnetic 2013-11-20, 03:27
+
Juhani Connolly 2013-08-29, 02:26
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB