Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> New Features Proposed for Apache Flume


Copy link to this message
-
Re: New Features Proposed for Apache Flume
FYI, I've just added a new morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup - https://issues.cloudera.org/browse/CDK-227

This can then be used in the MorphlineInterceptor or Morphline Sink.

Wolfgang.

On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:

> Thank you everyone for your very constructive feedbacks. They were very
> helpful.
>
> To provide some background, most of these suggestions have been inspired by
> features I have found in Logstash [3].
>
> I am going to spend more time to understand how the cdk morphline commands
> [4] work because I think it will really help with the transformation utils
> needed in FileSource.
>
> Regarding the GrokInterceptor, I was not aware of the existence of
> MorphlineInterceptor. It already does what I was proposing with
> GrokInterceptor. So we are cool from that end.
>
> In simple standalone tests, the commons-io class that I am planning to use
> for the FileSource handles file rotations well but I have not tested
> renames or removals yet.
>
> Regarding the GeoIPInterceptor we can provide links for downloading the
> Maxmind database seperately without bundling the IP database with Flume
> releases.
>
> This is how the Logstash project does it.
>
> Because of the large number of events expected, I was planning to use
> Lucene because of the speed of executing range queries from trie indexing
> [5] and the results can also be cached in-memory if they have been
> previously executed.
>
> I can perform some benchmarks with and without Lucene and see if the
> performance differences justify using it for the lookups.
>
> My gut feeling is that using Lucene will lead to shorter processing times
> as the volume of events increase.
>
> The RedisSource and RedisSink features will just be simple sources and
> sinks. The sink will push [1] events to the Redis server and the source
> will do a blocking pop [2] as it waits for new events to occur on the Redis
> Server.
>
> I am still trying out a few things, this part is not yet finalized.
>
> Regarding contributing features as plugins, how are plugins typically
> contributed and managed?
>
> Do I have to create github repo and manage it independently or are they
> contributed as patches to the Flume project?
>
> [1] http://redis.io/commands/rpush
> [2] http://redis.io/commands/blpop
> [3] http://logstash.net/docs/1.2.1/
> [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [5]
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
>
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> *http://massivelogdata.com*
>
>
> On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[EMAIL PROTECTED]>wrote:
>
>> Re: GrokInterceptor
>>
>> This functionality is already available in the form of the Apache Flume
>> MorphlineInterceptor [1] with the grok command [2]. While grok is very
>> useful, consider that grok alone often isn't enough - you typically need
>> some other log event processing commands as well, for example as contained
>> in morphlines [3].
>>
>> Re: FileSource
>>
>> True file tailing would be great.
>>
>> Merging multiple lines into one event can already be done with the
>> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
>> morphline directly into that new FileSource?
>>
>> Re: GeoIPInterceptor
>>
>> Seems to me that it would be more flexible, powerful and reusable to add
>> this kind of functionality as a morphline command - contributions welcome!
>>
>> Finally, a word of caution, Maxmind is a good geo db, and I've used it
>> before, but it has some LGPL issues that may or may not be workable in this
>> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
>> can do fast maxmind lookups directly without Lucene.
>>
>> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB