Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> New Features Proposed for Apache Flume


Copy link to this message
-
Re: New Features Proposed for Apache Flume
FYI, I've just added a new morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup - https://issues.cloudera.org/browse/CDK-227

This can then be used in the MorphlineInterceptor or Morphline Sink.

Wolfgang.

On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:

> Thank you everyone for your very constructive feedbacks. They were very
> helpful.
>
> To provide some background, most of these suggestions have been inspired by
> features I have found in Logstash [3].
>
> I am going to spend more time to understand how the cdk morphline commands
> [4] work because I think it will really help with the transformation utils
> needed in FileSource.
>
> Regarding the GrokInterceptor, I was not aware of the existence of
> MorphlineInterceptor. It already does what I was proposing with
> GrokInterceptor. So we are cool from that end.
>
> In simple standalone tests, the commons-io class that I am planning to use
> for the FileSource handles file rotations well but I have not tested
> renames or removals yet.
>
> Regarding the GeoIPInterceptor we can provide links for downloading the
> Maxmind database seperately without bundling the IP database with Flume
> releases.
>
> This is how the Logstash project does it.
>
> Because of the large number of events expected, I was planning to use
> Lucene because of the speed of executing range queries from trie indexing
> [5] and the results can also be cached in-memory if they have been
> previously executed.
>
> I can perform some benchmarks with and without Lucene and see if the
> performance differences justify using it for the lookups.
>
> My gut feeling is that using Lucene will lead to shorter processing times
> as the volume of events increase.
>
> The RedisSource and RedisSink features will just be simple sources and
> sinks. The sink will push [1] events to the Redis server and the source
> will do a blocking pop [2] as it waits for new events to occur on the Redis
> Server.
>
> I am still trying out a few things, this part is not yet finalized.
>
> Regarding contributing features as plugins, how are plugins typically
> contributed and managed?
>
> Do I have to create github repo and manage it independently or are they
> contributed as patches to the Flume project?
>
> [1] http://redis.io/commands/rpush
> [2] http://redis.io/commands/blpop
> [3] http://logstash.net/docs/1.2.1/
> [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [5]
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
>
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> *http://massivelogdata.com*
>
>
> On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[EMAIL PROTECTED]>wrote:
>
>> Re: GrokInterceptor
>>
>> This functionality is already available in the form of the Apache Flume
>> MorphlineInterceptor [1] with the grok command [2]. While grok is very
>> useful, consider that grok alone often isn't enough - you typically need
>> some other log event processing commands as well, for example as contained
>> in morphlines [3].
>>
>> Re: FileSource
>>
>> True file tailing would be great.
>>
>> Merging multiple lines into one event can already be done with the
>> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
>> morphline directly into that new FileSource?
>>
>> Re: GeoIPInterceptor
>>
>> Seems to me that it would be more flexible, powerful and reusable to add
>> this kind of functionality as a morphline command - contributions welcome!
>>
>> Finally, a word of caution, Maxmind is a good geo db, and I've used it
>> before, but it has some LGPL issues that may or may not be workable in this
>> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
>> can do fast maxmind lookups directly without Lucene.
>>
>> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor