-Re: New Features Proposed for Apache Flume
Wolfgang Hoschek 2013-11-16, 10:26
FYI, I've just added a new morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup - https://issues.cloudera.org/browse/CDK-227
This can then be used in the MorphlineInterceptor or Morphline Sink.
On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:
> Thank you everyone for your very constructive feedbacks. They were very
> To provide some background, most of these suggestions have been inspired by
> features I have found in Logstash .
> I am going to spend more time to understand how the cdk morphline commands
>  work because I think it will really help with the transformation utils
> needed in FileSource.
> Regarding the GrokInterceptor, I was not aware of the existence of
> MorphlineInterceptor. It already does what I was proposing with
> GrokInterceptor. So we are cool from that end.
> In simple standalone tests, the commons-io class that I am planning to use
> for the FileSource handles file rotations well but I have not tested
> renames or removals yet.
> Regarding the GeoIPInterceptor we can provide links for downloading the
> Maxmind database seperately without bundling the IP database with Flume
> This is how the Logstash project does it.
> Because of the large number of events expected, I was planning to use
> Lucene because of the speed of executing range queries from trie indexing
>  and the results can also be cached in-memory if they have been
> previously executed.
> I can perform some benchmarks with and without Lucene and see if the
> performance differences justify using it for the lookups.
> My gut feeling is that using Lucene will lead to shorter processing times
> as the volume of events increase.
> The RedisSource and RedisSink features will just be simple sources and
> sinks. The sink will push  events to the Redis server and the source
> will do a blocking pop  as it waits for new events to occur on the Redis
> I am still trying out a few things, this part is not yet finalized.
> Regarding contributing features as plugins, how are plugins typically
> contributed and managed?
> Do I have to create github repo and manage it independently or are they
> contributed as patches to the Flume project?
>  http://redis.io/commands/rpush
>  http://redis.io/commands/blpop
>  http://logstash.net/docs/1.2.1/
>  http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[EMAIL PROTECTED]>wrote:
>> Re: GrokInterceptor
>> This functionality is already available in the form of the Apache Flume
>> MorphlineInterceptor  with the grok command . While grok is very
>> useful, consider that grok alone often isn't enough - you typically need
>> some other log event processing commands as well, for example as contained
>> in morphlines .
>> Re: FileSource
>> True file tailing would be great.
>> Merging multiple lines into one event can already be done with the
>> MorphlineInterceptor with the readMultiLine command . Or maybe embed a
>> morphline directly into that new FileSource?
>> Re: GeoIPInterceptor
>> Seems to me that it would be more flexible, powerful and reusable to add
>> this kind of functionality as a morphline command - contributions welcome!
>> Finally, a word of caution, Maxmind is a good geo db, and I've used it
>> before, but it has some LGPL issues that may or may not be workable in this
>> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
>> can do fast maxmind lookups directly without Lucene.
>>  http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor