I recommend Kafka or Flume-NG for this.

Our Analytics team is using a Kafka Producer on each server to tail logs
and ship them to Kafka. We use Oozie to schedule a MapReduce consumer every
few minutes to read all the Kafka topics into HDFS.

We use Kafka as a buffer, we keep a few weeks of data there. Our security
team for example sometimes connects up and consumes some logs for various
purposes. Usually when they want aggregate log data in realtime.

Most folks access them in HDFS. We have <1 minute of delay for most log
lines getting from the server where they were written to HDFS.

On Fri, Jun 7, 2013 at 5:30 PM, Mark <[EMAIL PROTECTED]> wrote:


*Jonathan Creasy* | Sr. Ops Engineer

e: [EMAIL PROTECTED] | t: 314.580.8909

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB