Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # dev - New Features Proposed for Apache Flume

Copy link to this message
Re: New Features Proposed for Apache Flume
Wolfgang Hoschek 2013-11-16, 10:26
FYI, I've just added a new morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup - https://issues.cloudera.org/browse/CDK-227

This can then be used in the MorphlineInterceptor or Morphline Sink.


On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:

> Thank you everyone for your very constructive feedbacks. They were very
> helpful.
> To provide some background, most of these suggestions have been inspired by
> features I have found in Logstash [3].
> I am going to spend more time to understand how the cdk morphline commands
> [4] work because I think it will really help with the transformation utils
> needed in FileSource.
> Regarding the GrokInterceptor, I was not aware of the existence of
> MorphlineInterceptor. It already does what I was proposing with
> GrokInterceptor. So we are cool from that end.
> In simple standalone tests, the commons-io class that I am planning to use
> for the FileSource handles file rotations well but I have not tested
> renames or removals yet.
> Regarding the GeoIPInterceptor we can provide links for downloading the
> Maxmind database seperately without bundling the IP database with Flume
> releases.
> This is how the Logstash project does it.
> Because of the large number of events expected, I was planning to use
> Lucene because of the speed of executing range queries from trie indexing
> [5] and the results can also be cached in-memory if they have been
> previously executed.
> I can perform some benchmarks with and without Lucene and see if the
> performance differences justify using it for the lookups.
> My gut feeling is that using Lucene will lead to shorter processing times
> as the volume of events increase.
> The RedisSource and RedisSink features will just be simple sources and
> sinks. The sink will push [1] events to the Redis server and the source
> will do a blocking pop [2] as it waits for new events to occur on the Redis
> Server.
> I am still trying out a few things, this part is not yet finalized.
> Regarding contributing features as plugins, how are plugins typically
> contributed and managed?
> Do I have to create github repo and manage it independently or are they
> contributed as patches to the Flume project?
> [1] http://redis.io/commands/rpush
> [2] http://redis.io/commands/blpop
> [3] http://logstash.net/docs/1.2.1/
> [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [5]
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> *http://massivelogdata.com*
> On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[EMAIL PROTECTED]>wrote:
>> Re: GrokInterceptor
>> This functionality is already available in the form of the Apache Flume
>> MorphlineInterceptor [1] with the grok command [2]. While grok is very
>> useful, consider that grok alone often isn't enough - you typically need
>> some other log event processing commands as well, for example as contained
>> in morphlines [3].
>> Re: FileSource
>> True file tailing would be great.
>> Merging multiple lines into one event can already be done with the
>> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
>> morphline directly into that new FileSource?
>> Re: GeoIPInterceptor
>> Seems to me that it would be more flexible, powerful and reusable to add
>> this kind of functionality as a morphline command - contributions welcome!
>> Finally, a word of caution, Maxmind is a good geo db, and I've used it
>> before, but it has some LGPL issues that may or may not be workable in this
>> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
>> can do fast maxmind lookups directly without Lucene.
>> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor