Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # dev - New Features Proposed for Apache Flume


Copy link to this message
-
Re: New Features Proposed for Apache Flume
Juhani Connolly 2013-08-29, 02:26
> FileSource
FlumeOG had a tailsource and it was a maintenance nightmare though it
allowed tailing a full directory. How well does the commons-io class
handle file rotations, renames and such?

 > GeoIpInterceptor
I think this might be too specific and seems like something that should
be separately maintained as a plugin, especially when you consider
licensing. But I've used the maxmind db before and like Wolfgang said,
loading it to ram would be better.

 > RedisSink
We have a custom redis sink and it's pretty specific to our needs. I'm
not sure how you'd set up a generic one that fills everyones needs.

 >GrokInterceptor
Not informed enough to comment

Overall though personally I feel we shouldn't get bogged down in trying
to provide every imaginable custom component out of the main project.
FileSource is something that if done right would help a lot of people,
but I think the other 3 are niche enough that they would be better off
as plugins(maybe redis could be included, but how would you implement it
to fit various needs?)

On 08/29/2013 01:06 AM, Israel Ekpo wrote:
> Hello everyone,
>
> I think it will be helpful to have the following features in Apache Flume:
>
> I plan to open JIRA issues for these proposals tonight.
>
> I am about to start creating patches for some of them but I would like to
> know what you guys think so that I can tweak my logic accordingly without
> going too far.
>
> *When you get a chance, please take a look at them and give me some
> feedback.*
>
> Thanks.
>
>
> *FileSource*
>
> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
> specific files for events.
>
> This allows us to, regardless of the operating system, have the ability to
> watch files for future events as they occur.
>
> It also allows us to step in and determine if two or more events should be
> merged into one events if newline characters are present in an event.
>
> We can configure certain regular expressions that determines if a specific
> line is a new event or part of the prevent event.
>
> Essentially, this source will have the ability to merge multiple lines into
> one event before it is passed on to interceptors.
>
> It has been complicated group multiple lines into a single event with the
> Spooling Directory Source or Exec Source. I tried creating custom
> deserializers but it was hard to get around the logic used to parse the
> files.
>
> Using the Spooling Directory also means we cannot watch the original files
> so we need a background process to copy over the log files into the
> spooling directory which requires additional setup.
>
> The tail command is not also available on all operating systems out of the
> box.
>
>
> *GrokInterceptor*
>
> With this interceptor we can parse semi-structure and unstructured text and
> log data in the headers and body of the event into something structured
> that can be easily queried.
> I plan to use the information [2] and [3] for this.
> With this interceptor, we can extract HTTP response codes, response times,
> user agents, IP addresses and a whole bunch of useful data point from free
> form text.
>
>
>
> *GeoIPInterceptor*
>
> This is for IP intelligence.
>
> This interceptor will allow us to use the value of an IP address in the
> event header or body of the request to estimate the geographical location
> of the IP address.
>
> Using the database available here [4], we can inject the two-letter code or
> country name of the IP address into the event.
>
> We can also deduce other values such as city name, postalCode, latitude,
> longitude, Internet Service Provider and Organization name.
>
> This can be very helpful in analyzing traffic patterns and target audience
> from webserver or application logs.
>
> The database is loaded into a Lucene index when the agent is started up.
> The index is only created once if it does not already exists.
>
> As the interceptor comes across events, it maps the IP address to a variety
> of values that can be injected into the events.