Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Uncaught Exception When Using Spooling Directory Source

Copy link to this message
Re: Uncaught Exception When Using Spooling Directory Source
The Spooling Directory Source is best used for sending old data / backups
through Flume, as opposed to trying to use it for realtime data (due to, as
you discovered, you aren't supposed to write directly to a file in that
directory  but rather place closed files in there). You could implement
what Mike mentioned above about rolling the logs into the spooling
directory, but there are other options.

If you are looking to pull data in real time, the Exec
above does work. The one downside with this is that this source
is not the most reliable, as is mentioned in the red box in that link, and
you will have to monitor it to make sure it hasn't crashed. However, other
than the Spooling Directory source and any custom source you write, this is
the only other pulling source.

But depending on how your system is set up, you could set up a system for
pushing your logs into Flume. Here are some options:

If the log files you want to capture use Log4J, then there is a Log4JAppender
<http://flume.apache.org/FlumeUserGuide.html#log4j-appender>which will send
events directly to Flume. The benefit to this is that you let Flume take
control of the events right as they are generated; they are sent through
Avro to your specified host/ip where you will have a Flume agent with an Avro
Source <http://flume.apache.org/FlumeUserGuide.html#flume-sources> running.

Another alternative to the above if you don't use Log4J but you do have
direct control over the application is to use the Embedded Flume
This is even more powerful than the log4j appender as you have more control
over how it works and you are able to use the Flume channels with it. This
would end up pushing events via Avro to your Flume agent to then

There are a variety of network methods that can communicate with Flume.
Flume has support for listening on a specified port with the Netcat
getting events via HTTP
and if your application uses Syslog that's
supported <http://flume.apache.org/FlumeUserGuide.html#syslog-sources> as

In summation, if you need to set up a pulling system you will need to place
a Flume agent on each of your servers and have it use a Spooling Directory
or Exec source; or if your system is configurable enough you will be able
to modify it in various possible ways to push the logs to Flume.

I hope some of that was helpful,

- Connor
On Fri, Jan 18, 2013 at 12:18 AM, Henry Ma <[EMAIL PROTECTED]> wrote:

> We have an advertisement system, which owns hundreds of servers running
> service such as resin/nginx, and each of them generates log files to a
> local location every seconds. What we need is to collect all the log files
> in time to a central storage such as MooseFS for real-time analysis, and
> then archive them to HDFS by hour.
> We want to deploy Flume to collect log files as soon as they are generated
> from nearly one hundred servers (the server list may be added or removed at
> any time) to a central location, and then archive to HDFS each hour.
> By now the log files cannot be pushed to any collecting system. We want to
> the collecting system can PULL all of them remotely.
> Can you give me some guide? Thanks!
> On Fri, Jan 18, 2013 at 3:45 PM, Mike Percy <[EMAIL PROTECTED]> wrote:
>> Can you provide more detail about what kinds of services?
>> If you roll the logs every 5 minutes or so then you can configure the
>> spooling source to pick them up once they are rolled by either rolling them
>> into a directory for immutable files or using the trunk version of the
>> spooling file source to specify a filter to ignore files that don't match a
>> "rolled" pattern.
>> You could also use exec source with "tail -F" but that is much more