Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> New Features Proposed for Apache Flume

Copy link to this message
Re: New Features Proposed for Apache Flume
Thank you everyone for your very constructive feedbacks. They were very

To provide some background, most of these suggestions have been inspired by
features I have found in Logstash [3].

I am going to spend more time to understand how the cdk morphline commands
[4] work because I think it will really help with the transformation utils
needed in FileSource.

Regarding the GrokInterceptor, I was not aware of the existence of
MorphlineInterceptor. It already does what I was proposing with
GrokInterceptor. So we are cool from that end.

In simple standalone tests, the commons-io class that I am planning to use
for the FileSource handles file rotations well but I have not tested
renames or removals yet.

Regarding the GeoIPInterceptor we can provide links for downloading the
Maxmind database seperately without bundling the IP database with Flume

This is how the Logstash project does it.

Because of the large number of events expected, I was planning to use
Lucene because of the speed of executing range queries from trie indexing
[5] and the results can also be cached in-memory if they have been
previously executed.

I can perform some benchmarks with and without Lucene and see if the
performance differences justify using it for the lookups.

My gut feeling is that using Lucene will lead to shorter processing times
as the volume of events increase.

The RedisSource and RedisSink features will just be simple sources and
sinks. The sink will push [1] events to the Redis server and the source
will do a blocking pop [2] as it waits for new events to occur on the Redis

I am still trying out a few things, this part is not yet finalized.

Regarding contributing features as plugins, how are plugins typically
contributed and managed?

Do I have to create github repo and manage it independently or are they
contributed as patches to the Flume project?

[1] http://redis.io/commands/rpush
[2] http://redis.io/commands/blpop
[3] http://logstash.net/docs/1.2.1/
[4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html

*Author and Instructor for the Upcoming Book and Lecture Series*
*Massive Log Data Aggregation, Processing, Searching and Visualization with
Open Source Software*
On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[EMAIL PROTECTED]>wrote:

> Re: GrokInterceptor
> This functionality is already available in the form of the Apache Flume
> MorphlineInterceptor [1] with the grok command [2]. While grok is very
> useful, consider that grok alone often isn't enough - you typically need
> some other log event processing commands as well, for example as contained
> in morphlines [3].
> Re: FileSource
> True file tailing would be great.
> Merging multiple lines into one event can already be done with the
> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
> morphline directly into that new FileSource?
> Re: GeoIPInterceptor
> Seems to me that it would be more flexible, powerful and reusable to add
> this kind of functionality as a morphline command - contributions welcome!
> Finally, a word of caution, Maxmind is a good geo db, and I've used it
> before, but it has some LGPL issues that may or may not be workable in this
> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
> can do fast maxmind lookups directly without Lucene.
> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> [2]
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [4]
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
> Wolfgang.
> >
> > *FileSource*
> >
> > Using the Tailer feature from Apache Commons I/O utility [1], we can tail
> > specific files for events.