Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # dev >> New Features Proposed for Apache Flume


Copy link to this message
-
Re: New Features Proposed for Apache Flume
> FileSource
FlumeOG had a tailsource and it was a maintenance nightmare though it
allowed tailing a full directory. How well does the commons-io class
handle file rotations, renames and such?

 > GeoIpInterceptor
I think this might be too specific and seems like something that should
be separately maintained as a plugin, especially when you consider
licensing. But I've used the maxmind db before and like Wolfgang said,
loading it to ram would be better.

 > RedisSink
We have a custom redis sink and it's pretty specific to our needs. I'm
not sure how you'd set up a generic one that fills everyones needs.

 >GrokInterceptor
Not informed enough to comment

Overall though personally I feel we shouldn't get bogged down in trying
to provide every imaginable custom component out of the main project.
FileSource is something that if done right would help a lot of people,
but I think the other 3 are niche enough that they would be better off
as plugins(maybe redis could be included, but how would you implement it
to fit various needs?)

On 08/29/2013 01:06 AM, Israel Ekpo wrote:
> Hello everyone,
>
> I think it will be helpful to have the following features in Apache Flume:
>
> I plan to open JIRA issues for these proposals tonight.
>
> I am about to start creating patches for some of them but I would like to
> know what you guys think so that I can tweak my logic accordingly without
> going too far.
>
> *When you get a chance, please take a look at them and give me some
> feedback.*
>
> Thanks.
>
>
> *FileSource*
>
> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
> specific files for events.
>
> This allows us to, regardless of the operating system, have the ability to
> watch files for future events as they occur.
>
> It also allows us to step in and determine if two or more events should be
> merged into one events if newline characters are present in an event.
>
> We can configure certain regular expressions that determines if a specific
> line is a new event or part of the prevent event.
>
> Essentially, this source will have the ability to merge multiple lines into
> one event before it is passed on to interceptors.
>
> It has been complicated group multiple lines into a single event with the
> Spooling Directory Source or Exec Source. I tried creating custom
> deserializers but it was hard to get around the logic used to parse the
> files.
>
> Using the Spooling Directory also means we cannot watch the original files
> so we need a background process to copy over the log files into the
> spooling directory which requires additional setup.
>
> The tail command is not also available on all operating systems out of the
> box.
>
>
> *GrokInterceptor*
>
> With this interceptor we can parse semi-structure and unstructured text and
> log data in the headers and body of the event into something structured
> that can be easily queried.
> I plan to use the information [2] and [3] for this.
> With this interceptor, we can extract HTTP response codes, response times,
> user agents, IP addresses and a whole bunch of useful data point from free
> form text.
>
>
>
> *GeoIPInterceptor*
>
> This is for IP intelligence.
>
> This interceptor will allow us to use the value of an IP address in the
> event header or body of the request to estimate the geographical location
> of the IP address.
>
> Using the database available here [4], we can inject the two-letter code or
> country name of the IP address into the event.
>
> We can also deduce other values such as city name, postalCode, latitude,
> longitude, Internet Service Provider and Organization name.
>
> This can be very helpful in analyzing traffic patterns and target audience
> from webserver or application logs.
>
> The database is loaded into a Lucene index when the agent is started up.
> The index is only created once if it does not already exists.
>
> As the interceptor comes across events, it maps the IP address to a variety
> of values that can be injected into the events.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB