Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Uncaught Exception When Using Spooling Directory Source


Copy link to this message
-
Re: Uncaught Exception When Using Spooling Directory Source
+1, awesome summary Connor!

In the future, maybe someone wants to take a look at doing something with
inotify, i.e.
https://github.com/manos/python-inotify-tail_example/blob/master/tail-F_inotify.py
and
inode information to create a reliable "tail" implementation that is aware
of file rolling and keeps track of which files have been processed (and how
much) based on their inodes. It could post to the HTTP source as an
integration point. Or it could be written in C/C++, or using JNI in Java.
Sadly it would likely not be portable between most OSes.

Such a thing cannot be written reliably in pure Java, because at least JDK6
does not have access to inode information, so you end up with really nasty
race conditions. Without reliability guarantees, you may as well use tail
-F.

Regards,
Mike

On Fri, Jan 18, 2013 at 1:13 AM, Connor Woodson <[EMAIL PROTECTED]>wrote:

> The Spooling Directory Source is best used for sending old data / backups
> through Flume, as opposed to trying to use it for realtime data (due to, as
> you discovered, you aren't supposed to write directly to a file in that
> directory  but rather place closed files in there). You could implement
> what Mike mentioned above about rolling the logs into the spooling
> directory, but there are other options.
>
> If you are looking to pull data in real time, the Exec Source<http://flume.apache.org/FlumeUserGuide.html#exec-source>mentioned above does work. The one downside with this is that this source
> is not the most reliable, as is mentioned in the red box in that link, and
> you will have to monitor it to make sure it hasn't crashed. However, other
> than the Spooling Directory source and any custom source you write, this is
> the only other pulling source.
>
> But depending on how your system is set up, you could set up a system for
> pushing your logs into Flume. Here are some options:
>
> If the log files you want to capture use Log4J, then there is a Log4JAppender
> <http://flume.apache.org/FlumeUserGuide.html#log4j-appender>which will
> send events directly to Flume. The benefit to this is that you let Flume
> take control of the events right as they are generated; they are sent
> through Avro to your specified host/ip where you will have a Flume agent
> with an Avro Source<http://flume.apache.org/FlumeUserGuide.html#flume-sources>running.
>
> Another alternative to the above if you don't use Log4J but you do have
> direct control over the application is to use the Embedded Flume Agent<https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeDeveloperGuide.rst#embedded-agent>.
> This is even more powerful than the log4j appender as you have more control
> over how it works and you are able to use the Flume channels with it. This
> would end up pushing events via Avro to your Flume agent to then
> collect/process/store.
>
> There are a variety of network methods that can communicate with Flume.
> Flume has support for listening on a specified port with the Netcat Source<http://flume.apache.org/FlumeUserGuide.html#netcat-source>,
> getting events via HTTP Post<http://flume.apache.org/FlumeUserGuide.html#http-source>messages, and if your application uses Syslog that's
> supported <http://flume.apache.org/FlumeUserGuide.html#syslog-sources> as
> well.
>
> In summation, if you need to set up a pulling system you will need to
> place a Flume agent on each of your servers and have it use a Spooling
> Directory or Exec source; or if your system is configurable enough you will
> be able to modify it in various possible ways to push the logs to Flume.
>
> I hope some of that was helpful,
>
> - Connor
>
>
> On Fri, Jan 18, 2013 at 12:18 AM, Henry Ma <[EMAIL PROTECTED]>wrote:
>
>> We have an advertisement system, which owns hundreds of servers running
>> service such as resin/nginx, and each of them generates log files to a
>> local location every seconds. What we need is to collect all the log files
>> in time to a central storage such as MooseFS for real-time analysis, and