Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - Uncaught Exception When Using Spooling Directory Source


+
Henry Ma 2013-01-18, 04:24
+
Brock Noland 2013-01-18, 04:39
+
Henry Ma 2013-01-18, 05:22
+
Brock Noland 2013-01-18, 05:31
+
Patrick Wendell 2013-01-18, 05:48
+
Henry Ma 2013-01-18, 05:59
+
Mike Percy 2013-01-18, 06:05
+
Henry Ma 2013-01-18, 06:23
+
Mike Percy 2013-01-18, 07:45
+
Henry Ma 2013-01-18, 08:05
+
Henry Ma 2013-01-18, 08:18
+
Connor Woodson 2013-01-18, 09:13
+
Mike Percy 2013-01-18, 09:26
Copy link to this message
-
Re: Uncaught Exception When Using Spooling Directory Source
Henry Ma 2013-01-18, 09:32
Thank you very much, Connor!! It is really HELPFUL.
On Fri, Jan 18, 2013 at 5:13 PM, Connor Woodson <[EMAIL PROTECTED]>wrote:

> The Spooling Directory Source is best used for sending old data / backups
> through Flume, as opposed to trying to use it for realtime data (due to, as
> you discovered, you aren't supposed to write directly to a file in that
> directory  but rather place closed files in there). You could implement
> what Mike mentioned above about rolling the logs into the spooling
> directory, but there are other options.
>
> If you are looking to pull data in real time, the Exec Source<http://flume.apache.org/FlumeUserGuide.html#exec-source>mentioned above does work. The one downside with this is that this source
> is not the most reliable, as is mentioned in the red box in that link, and
> you will have to monitor it to make sure it hasn't crashed. However, other
> than the Spooling Directory source and any custom source you write, this is
> the only other pulling source.
>
> But depending on how your system is set up, you could set up a system for
> pushing your logs into Flume. Here are some options:
>
> If the log files you want to capture use Log4J, then there is a Log4JAppender
> <http://flume.apache.org/FlumeUserGuide.html#log4j-appender>which will
> send events directly to Flume. The benefit to this is that you let Flume
> take control of the events right as they are generated; they are sent
> through Avro to your specified host/ip where you will have a Flume agent
> with an Avro Source<http://flume.apache.org/FlumeUserGuide.html#flume-sources>running.
>
> Another alternative to the above if you don't use Log4J but you do have
> direct control over the application is to use the Embedded Flume Agent<https://github.com/apache/flume/blob/trunk/flume-ng-doc/sphinx/FlumeDeveloperGuide.rst#embedded-agent>.
> This is even more powerful than the log4j appender as you have more control
> over how it works and you are able to use the Flume channels with it. This
> would end up pushing events via Avro to your Flume agent to then
> collect/process/store.
>
> There are a variety of network methods that can communicate with Flume.
> Flume has support for listening on a specified port with the Netcat Source<http://flume.apache.org/FlumeUserGuide.html#netcat-source>,
> getting events via HTTP Post<http://flume.apache.org/FlumeUserGuide.html#http-source>messages, and if your application uses Syslog that's
> supported <http://flume.apache.org/FlumeUserGuide.html#syslog-sources> as
> well.
>
> In summation, if you need to set up a pulling system you will need to
> place a Flume agent on each of your servers and have it use a Spooling
> Directory or Exec source; or if your system is configurable enough you will
> be able to modify it in various possible ways to push the logs to Flume.
>
> I hope some of that was helpful,
>
> - Connor
>
>
> On Fri, Jan 18, 2013 at 12:18 AM, Henry Ma <[EMAIL PROTECTED]>wrote:
>
>> We have an advertisement system, which owns hundreds of servers running
>> service such as resin/nginx, and each of them generates log files to a
>> local location every seconds. What we need is to collect all the log files
>> in time to a central storage such as MooseFS for real-time analysis, and
>> then archive them to HDFS by hour.
>>
>> We want to deploy Flume to collect log files as soon as they are
>> generated from nearly one hundred servers (the server list may be added or
>> removed at any time) to a central location, and then archive to HDFS each
>> hour.
>>
>> By now the log files cannot be pushed to any collecting system. We want
>> to the collecting system can PULL all of them remotely.
>>
>> Can you give me some guide? Thanks!
>>
>>
>> On Fri, Jan 18, 2013 at 3:45 PM, Mike Percy <[EMAIL PROTECTED]> wrote:
>>
>>> Can you provide more detail about what kinds of services?
>>>
>>> If you roll the logs every 5 minutes or so then you can configure the
>>> spooling source to pick them up once they are rolled by either rolling them
Best Regards,
马环宇
网易有道 EAD-Platform
POPO:   [EMAIL PROTECTED]
MSN:    [EMAIL PROTECTED]
MOBILE: 18600601996