We have been using a flume like system for such usecases at significantly
large scale and it has been working quite well.
Would like to hear thoughts/challenges around using zeromq alike systems at
good enough scale.
"you are the average of 5 people you spend the most time with"
On Aug 5, 2013 11:29 PM, "Public Network Services" <
[EMAIL PROTECTED]> wrote:
> I am facing a large-scale usage scenario of log collection from a Hadoop
> cluster and examining ways as to how it should be implemented.
> More specifically, imagine a cluster that has hundreds of nodes, each of
> which constantly produces Syslog events that need to be gathered an
> analyzed at another point. The total amount of logs could be tens of
> gigabytes per day, if not more, and the reception rate in the order of
> thousands of events per second, if not more.
> One solution is to send those events over the network (e.g., using using
> flume) and collect them in one or more (less than 5) nodes in the cluster,
> or in another location, whereby the logs will be processed by a either
> constantly MapReduce job, or by non-Hadoop servers running some log
> processing application.
> Another approach could be to deposit all these events into a queuing
> system like ActiveMQ or RabbitMQ, or whatever.
> In all cases, the main objective is to be able to do real-time log
> What would be the best way of implementing the above scenario?