I have a similar use case that has been in production for about a year. We have 6 web servers sending about 15GB a day of web server logs to an 11 node Hadoop cluster. Additionally those same hosts send another few GB of data from the web applications themselves to HDFS using flume. I would say in total we send about 120GB a day to HDFS from those 6 hosts. The web servers each run a local flume agent with an identical config. I will just describe our configuration as I think it answers most of your questions.
The web logs are IIS log files that roll once a day and are written to continuously. We wanted a relatively low latency from log file to HDFS so we use a scheduled task to create an incremental 'diff' file every minute and place it in a spool directory. From there a spoolDir source on the local agent processes the file. The same task that creates the diff files also cleans up ones that have been processed by flume. This spoolDir source is just for weblogs and has no special routing, it uses a file channel and Avro sinks to connect to HDFS. There are two sinks in a failover configuration using a sink group that send to flume agents co-located on HDFS data nodes.
The application logs are JSON data that is sent directly from the application to flume via an httpSource. Again, the logs are sent to the same local agent. We have about a dozen separate log streams in total from our application but we send all these events to the one httpSource. Using headers we split into 'high' and 'low' priority streams using a multiplexing channel selector. We don't really do any special handling of the high priority events, we just watch one of them a LOT closer than the other. ;) These channels are also drained by an Avro sink and are sent to a different pair of flume agents, again co-located with data nodes.
The flume agents that run on the data nodes just have avro sources, file channels and HDFS sinks. We have found two HDFS sinks (without any sink groups) can be necessary to keep the channels from backing up in some cases. The amount of file channels on these agents varies, some log streams are split into their own channels first and many of them use a 'catch all' channel that uses tokenized paths on the sink to write the data to different locations in HDFS according to header values. The HDFS sinks all write DataStream files with format Text and bucket into Hive friendly partitioned directories by date and hour. The JSON events are one line each and we use a custom Hive SerDe for the JSON data. We use Hive external table definitions to read the data and use Oozie to process every log stream hourly into Snappy compressed internal Hive tables and then drop the raw data in 8 days. We don't use Impala all that much as most of our workflows just crunch the data and push it back into a SQL DW for the data people.
Here is a good start for capacity planning: https://cwiki.apache.org/confluence/display/FLUME/Flume%27s+Memory+Consumption.
We have gotten away with default channel sizes (1 million) so far without issue. We do try to separate the file channels to different physical disks as much as we can to optimize our hardware.
Hope that helps,
From: Asim Zafir [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, February 05, 2014 1:23 PM
To: [EMAIL PROTECTED]
Subject: distributed weblogs ingestion on HDFS via flume
Here is the problem statement, will be very much interested to have your valuable input and feedback on the following:
Assuming that fact that we generate 200GB of logs PER DAY from 50 webservers
Goal is to sync that to HDFS repository
1) do all the webserver in our case needs to run a flume agent?
2) do all the webserver will be acting as source in our setup ?
3) can we sync webservers logs directly to HDFS store by passing channels?
4) do we have a choice of directly synching the weblogs to HDFS store and not let the webserver right locally? what is the best practice?
5) what setup will that be where i would let the flume, sync a local datadire on weblogs, and sync it as soon as the date arrives to this directory?
6) do i need a dedicated flume server for this setup?
7) if i do use memory based channel and then do HDFS sync do I need a dedicated server, or can run those agents on the webserver itself, provided there is enough memory OR would it be recommended to position my config to a centralize flume server and the establish the sync.
8) how should we do the capacity planning for a memory based channel?
9) how should we do the capacity planning for a file based channel ?