Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Syslog Infrastructure with Flume


Copy link to this message
-
Re: Syslog Infrastructure with Flume
Hi Ron,

Yep -- looks like we'll be storing logs twice, for the time being. But
we're *so* close to not having to!

On 10/26/2012 11:06 PM, Ron Thielen wrote:
>
> I am exactly where you are with this, except for the problem of my not
> having had time to write a serializer to address the Hostname
> Timestamp issue.Questions about the use of Flume in this manner seem
> to recur on a regular basis, so it seems a common use case.
>
> Sorry I cannot offer a solution since I am in your shoes at the
> moment, unfortunately looking at storing logs twice.
>
> Ron Thielen
>
> Ronald J Thielen
>
> *From:*Josh West [mailto:[EMAIL PROTECTED]]
> *Sent:* Friday, October 26, 2012 9:05 AM
> *To:* [EMAIL PROTECTED]
> *Subject:* Syslog Infrastructure with Flume
>
> Hey folks,
>
> I've been experimenting with Flume for a few weeks now, trying to
> determine an approach to designing a reliable, highly available,
> scalable system to store logs from various sources, including syslog.  
> Ideally, this system will meet the following requirements:
>
>  1. Logs from syslog across all servers make their way into HDFS.
>  2. Logs are stored in HDFS in a manner that is available for
>     post-processing:
>       * Example:  HIVE partitions - with HDFS Flume Sink, can set
>         hdfs.path to
>         hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
>       * Example:  Custom map reduce jobs...
>  3. Logs are stored in HDFS in a manner that is available for
>     "reading" by sysadmins:
>       * During troubleshooting/firefighting, it is quite helpful to be
>         able to login to a central logging system and tail -f / grep logs.
>       * We need to be able to see the logs "live".
>
> Some folks may be wondering why are we choosing Flume for syslog,
> instead of something like Graylog2 or Logstash?  The answer is we will
> be using Flume + Hadoop for the transport and processing of other
> types of data in addition to syslog. For example, webserver access
> logs for post processing and statistical analysis.  So, we would like
> to make the most use of the Hadoop cluster, keeping all logs of all
> types in one redundant/scalable solution.  Additionally, by keeping
> both syslog and webserver access logs in Hadoop/HDFS, we can begin to
> correlate events.
>
> I've run into some snags while attempting to implement Flume in a
> manner that satisfies the requirements listed in the top of this message:
>
>  1. Logs to HDFS:
>       * I can indeed use the Flume HDFS Sink to reliably write logs
>         into HDFS.
>       * Needed to write custom serializer to add Hostname and
>         Timestamp fields back to syslog messages.
>       * See: https://issues.apache.org/jira/browse/FLUME-1666
>         <https://issues.apache.org/jira/browse/FLUME-1666>
>  2. Logs to HDFS in manner available for
>     reading/firefighting/troubleshooting by sysadmins:
>       * Flume HDFS Sink uses the BucketWriter for recording flume
>         events to HDFS.
>       * Creates data files like:
>         /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
>       * Each file is format of FlumeData (or custom prefix) followed
>         by . followed by unix timestamp of when the file was created.
>           o This is somewhat necessary... As you could have multiple
>             Flume writers, writing to the same HDFS, the files cannot
>             be opened by more than one writer.  So each writer should
>             write to its own file.
>       * Latest file, currently being written to, is suffixed with ".tmp".
>       * This approach is not very sysadmin-friendly....
>           o You have to find the latest (ie. the .tmp files) and
>             hadoop fs -tail -f /path/to/file.tmp
>           o Hadoop's fs -tail -f command first prints the entire
>             file's contents, then begins tailing.
>
> So the sum of it all is Flume is awesome for getting syslog (and
> other) data into HDFS for post processing, but not the best at getting

Josh West
Lead Systems Administrator
One.com, [EMAIL PROTECTED]
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB