Thanks for the rundown, lots of detail is better than not enough!
On 09/06/2012 01:37 AM, Steve Johnson wrote:
> Hi all, I wanted to share some of my test findings/concerns, etc..
> First off, I apologize for this being so verbose, but I feel I need to
> give a little bit of a background into our setup and needs to show the
> big picture. Please ignore this if your not interested, you've been
> But if you are, great, cause I do have some valid questions to follow
> and really looking forward to any constructive comments.
> Prior to a few weeks back, I have zero experience with Flume. Have
> been familiar with it's existence for some time (about a year) but
> nothing more than that.
> My company, generates about 8billion log records per day, spread
> across 5 dataceters, with about 200 servers in each location. So
> about 1.6 billion per day in each cage. We're growing and shotting to
> increase that to about 30billion per day based on holiday traffic
> growth and our companies growth. These log records are currently
> hourly rotated logback(slf4j) generated logs from our java
> applications, containing tab delimited ascii data of various widths.
> There's probably 25 different log types we collect, but generally all
> the same format, some average record lengths of 50-60 bytes, while
> some others average 1k in width.
> Right now, we collect them using a custom built java scheduling
> application. We have a machine dedicated to this at each DC. This
> box fires off some hourly jobs (within minutes after log rotations)
> that pulls all the logs from the 200+ servers (some servers generate
> up to 10 different log types per hour), uncompressed. We used to pull
> directly to our central location, and would initiate compression on
> the servers themselves, but this generated CPU/IO spikes every hour
> that were causing performance issues. So we put a remote machine in
> each node to handle local collection. They pull all the logs files
> locally first, then compress, then move into a queue. This happens
> across all 5 dc's in parallel. We have another set of schedulers in
> our central location that then each collect from those remote nodes.
> Pull them locally, then we do some ETL work and load the raw log data
> into our Greenplum warehouse for nightly aggregations and analysis.
> This is obviously becoming very cumbersome to maintain, as we have
> right now, 10 different schedulers running over 6 locations. Also, to
> guarantee we've fetched every log file, and also to guarantee we
> haven' double-loaded any raw data (this data has only a logrec that's
> maintained globally to guarantee uniqueness, so removing dupes is a
> nightmare, so we like to avoid that), we have to track every file
> pickup, for each hop (currently tracked in a postgresql db) and then
> use that for validation and to also make sure we don't pull a rotated
> log again (logs stay archived on their original servers for 7 days).
> A couple years back when we had 1 or even 2 dc's with only about 30
> servers in each, this wasn't so bad. But as you can imaging, we're
> looking at over 80k files generated per day to track and manage. When
> things run smooth, it's great, when we have issues, it's a pain to
> dig into.
> So what are the requirements I'm looking at for a replacement of said
> 1. Less, or no custom configuration, must be drop-in and go
> environment, right now, as we add/remove servers, I have to edit a lot
> of db records to tell the schedulers which servers have which types of
> logs, I also need to replicate it out, reload the configs and make
> sure log sources have ssh keys installed, etc.
> 2. Must be able to compress data going between flume agents in remote
> DC's to the flume agents in our central location. (bandwith for this
> kind of data is not cheap, right now by gzipping the hourly logs
> locally before we transfer, we get between about 6:1 to a 10:1
To have zero data loss you must use a reliable ingest system and a
lossless channel. Netcat source can't guarrantee delivery(if a channel
can't fit the sent messages for example, they will just get dropped).
Memory channel will lose data on a crash.
This is huge. The memorychannel uses a blocking queue of events, and I'm
pretty sure that it will misbehave beyond the limits of the integer
range. Seeing as it's signed, that would be around 2 billion(and with an
average event length of say 50, that would consume at least 100gb of
ram)? FileChannel may or may not deal with huge capacities better. The
capacity designation is for event count, not bytes of data. Someone did
however recently post an issue about making physical size a setting in
some form, maybe you want to add your feedback to that (
As mentioned above, NetcatSource is not a reliable ingest system as it
doesn't know about events that weren't committed. In the long run, for
lossless, you will want to deliver data via either avro or the
scribe(thrift) data format. However if you just want to test specs, try
using ExecSource to tail the log files and fiddle about with the
In its current implementation FileChannel is lossless and thus causes a
disk flush(which generally writes two separate files) for every
commit(one for every batch of events). This is going to mean very slow
throughput if you have small batches. You can however improve this a lot
by having the channels data directories and checkpoint directories on
separate disks(not always feasible). Or you can just make sure you're
batching more events at a time.
I'm going to guess this is the (netcat)source not being able to send
messages to the channel. Since it doesn't inform the ingest system about
the failed delivery, the ingest also can't resend. I assume you're not
getting any exceptions like the other p