Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Syslog Infrastructure with Flume

Copy link to this message
Re: Syslog Infrastructure with Flume
Hi Ron,

Yep -- looks like we'll be storing logs twice, for the time being. But
we're *so* close to not having to!

On 10/26/2012 11:06 PM, Ron Thielen wrote:
> I am exactly where you are with this, except for the problem of my not
> having had time to write a serializer to address the Hostname
> Timestamp issue.Questions about the use of Flume in this manner seem
> to recur on a regular basis, so it seems a common use case.
> Sorry I cannot offer a solution since I am in your shoes at the
> moment, unfortunately looking at storing logs twice.
> Ron Thielen
> Ronald J Thielen
> *From:*Josh West [mailto:[EMAIL PROTECTED]]
> *Sent:* Friday, October 26, 2012 9:05 AM
> *Subject:* Syslog Infrastructure with Flume
> Hey folks,
> I've been experimenting with Flume for a few weeks now, trying to
> determine an approach to designing a reliable, highly available,
> scalable system to store logs from various sources, including syslog.  
> Ideally, this system will meet the following requirements:
>  1. Logs from syslog across all servers make their way into HDFS.
>  2. Logs are stored in HDFS in a manner that is available for
>     post-processing:
>       * Example:  HIVE partitions - with HDFS Flume Sink, can set
>         hdfs.path to
>         hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
>       * Example:  Custom map reduce jobs...
>  3. Logs are stored in HDFS in a manner that is available for
>     "reading" by sysadmins:
>       * During troubleshooting/firefighting, it is quite helpful to be
>         able to login to a central logging system and tail -f / grep logs.
>       * We need to be able to see the logs "live".
> Some folks may be wondering why are we choosing Flume for syslog,
> instead of something like Graylog2 or Logstash?  The answer is we will
> be using Flume + Hadoop for the transport and processing of other
> types of data in addition to syslog. For example, webserver access
> logs for post processing and statistical analysis.  So, we would like
> to make the most use of the Hadoop cluster, keeping all logs of all
> types in one redundant/scalable solution.  Additionally, by keeping
> both syslog and webserver access logs in Hadoop/HDFS, we can begin to
> correlate events.
> I've run into some snags while attempting to implement Flume in a
> manner that satisfies the requirements listed in the top of this message:
>  1. Logs to HDFS:
>       * I can indeed use the Flume HDFS Sink to reliably write logs
>         into HDFS.
>       * Needed to write custom serializer to add Hostname and
>         Timestamp fields back to syslog messages.
>       * See: https://issues.apache.org/jira/browse/FLUME-1666
>         <https://issues.apache.org/jira/browse/FLUME-1666>
>  2. Logs to HDFS in manner available for
>     reading/firefighting/troubleshooting by sysadmins:
>       * Flume HDFS Sink uses the BucketWriter for recording flume
>         events to HDFS.
>       * Creates data files like:
>         /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>       * Each file is format of FlumeData (or custom prefix) followed
>         by . followed by unix timestamp of when the file was created.
>           o This is somewhat necessary... As you could have multiple
>             Flume writers, writing to the same HDFS, the files cannot
>             be opened by more than one writer.  So each writer should
>             write to its own file.
>       * Latest file, currently being written to, is suffixed with ".tmp".
>       * This approach is not very sysadmin-friendly....
>           o You have to find the latest (ie. the .tmp files) and
>             hadoop fs -tail -f /path/to/file.tmp
>           o Hadoop's fs -tail -f command first prints the entire
>             file's contents, then begins tailing.
> So the sum of it all is Flume is awesome for getting syslog (and
> other) data into HDFS for post processing, but not the best at getting

Josh West
Lead Systems Administrator