Andrei's flume interceptor mention reminds me of James Kinley's Top-N
example on his flume-interceptor-analytics GH repo at
On Tue, Aug 6, 2013 at 11:41 AM, Andrei <[EMAIL PROTECTED]> wrote:
> We have similar requirements and build our log collection system around
> RSyslog and Flume. It is not in production yet, but tests so far look pretty
> well. We rejected idea of using AMQP since it introduces large overhead for
> log events.
> Probably you can use Flume interceptors to do real-time processing on your
> events, though I haven't tried anything like that earlier. Alternatively,
> you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
> using Hadoop MapReduce for real-time processing of logs, and there's at
> least one important reason for this.
> As you probably know, Flume sources obtains new event and put it into
> channel, where sink then pulls it from. If we are talking about HDFS Sink,
> it has pull interval (normally time, but you can also use total size of
> events in channel). If this interval is large, you won't get real-time
> processing. And if it is small, Flume will produce large number of small
> files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
> files in a single block, and minimal block size is 64M, so each of your
> 10-100KB of logs will become 64M (multiplied by # of replicas!).
> Of course, you can use some ad-hoc solution like deleting small files from
> time to time or combining them into a larger file, but monitoring of such a
> system becomes much harder and may lead to unexpected results. So,
> processing log events before they get to HDFS seems to be better idea.
> On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <[EMAIL PROTECTED]> wrote:
>> We have been using a flume like system for such usecases at significantly
>> large scale and it has been working quite well.
>> Would like to hear thoughts/challenges around using zeromq alike systems
>> at good enough scale.
>> "you are the average of 5 people you spend the most time with"
>> On Aug 5, 2013 11:29 PM, "Public Network Services"
>> <[EMAIL PROTECTED]> wrote:
>>> I am facing a large-scale usage scenario of log collection from a Hadoop
>>> cluster and examining ways as to how it should be implemented.
>>> More specifically, imagine a cluster that has hundreds of nodes, each of
>>> which constantly produces Syslog events that need to be gathered an analyzed
>>> at another point. The total amount of logs could be tens of gigabytes per
>>> day, if not more, and the reception rate in the order of thousands of events
>>> per second, if not more.
>>> One solution is to send those events over the network (e.g., using using
>>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>>> or in another location, whereby the logs will be processed by a either
>>> constantly MapReduce job, or by non-Hadoop servers running some log
>>> processing application.
>>> Another approach could be to deposit all these events into a queuing
>>> system like ActiveMQ or RabbitMQ, or whatever.
>>> In all cases, the main objective is to be able to do real-time log
>>> What would be the best way of implementing the above scenario?