Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - Large-scale collection of logs from multiple Hadoop nodes


+
Public Network Services 2013-08-05, 17:58
+
Inder Pall 2013-08-06, 04:54
+
Andrei 2013-08-06, 06:11
Copy link to this message
-
Re: Large-scale collection of logs from multiple Hadoop nodes
Harsh J 2013-08-06, 06:28
Andrei's flume interceptor mention reminds me of James Kinley's Top-N
example on his flume-interceptor-analytics GH repo at
https://github.com/jrkinley/flume-interceptor-analytics#the-streaming-topn-example

On Tue, Aug 6, 2013 at 11:41 AM, Andrei <[EMAIL PROTECTED]> wrote:
> We have similar requirements and build our log collection system around
> RSyslog and Flume. It is not in production yet, but tests so far look pretty
> well. We rejected idea of using AMQP since it introduces large overhead for
> log events.
>
> Probably you can use Flume interceptors to do real-time processing on your
> events, though I haven't tried anything like that earlier. Alternatively,
> you can use Twitter Storm to handle your logs. Anyway, I wouldn't recommend
> using Hadoop MapReduce for real-time processing of logs, and there's at
> least one important reason for this.
>
> As you probably know, Flume sources obtains new event and put it into
> channel, where sink then pulls it from. If we are talking about HDFS Sink,
> it has pull interval (normally time, but you can also use total size of
> events in channel). If this interval is large, you won't get real-time
> processing. And if it is small, Flume will produce large number of small
> files in HDFS, say, of size 10-100KB. However, HDFS cannot store multiple
> files in a single block, and minimal block size is 64M, so each of your
> 10-100KB of logs will become 64M (multiplied by # of replicas!).
>
> Of course, you can use some ad-hoc solution like deleting small files from
> time to time or combining them into a larger file, but monitoring of such a
> system becomes much harder and may lead to unexpected results. So,
> processing log events before they get to HDFS seems to be better idea.
>
>
>
> On Tue, Aug 6, 2013 at 7:54 AM, Inder Pall <[EMAIL PROTECTED]> wrote:
>>
>> We have been using a flume like system for such usecases at significantly
>> large scale and it has been working quite well.
>>
>> Would like to hear thoughts/challenges around using zeromq alike systems
>> at good enough scale.
>>
>> inder
>> "you are the average of 5 people you spend the most time with"
>>
>> On Aug 5, 2013 11:29 PM, "Public Network Services"
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> Hi...
>>>
>>> I am facing a large-scale usage scenario of log collection from a Hadoop
>>> cluster and examining ways as to how it should be implemented.
>>>
>>> More specifically, imagine a cluster that has hundreds of nodes, each of
>>> which constantly produces Syslog events that need to be gathered an analyzed
>>> at another point. The total amount of logs could be tens of gigabytes per
>>> day, if not more, and the reception rate in the order of thousands of events
>>> per second, if not more.
>>>
>>> One solution is to send those events over the network (e.g., using using
>>> flume) and collect them in one or more (less than 5) nodes in the cluster,
>>> or in another location, whereby the logs will be processed by a either
>>> constantly MapReduce job, or by non-Hadoop servers running some log
>>> processing application.
>>>
>>> Another approach could be to deposit all these events into a queuing
>>> system like ActiveMQ or RabbitMQ, or whatever.
>>>
>>> In all cases, the main objective is to be able to do real-time log
>>> analysis.
>>>
>>> What would be the best way of implementing the above scenario?
>>>
>>> Thanks!
>>>
>>> PNS
>>>
>

--
Harsh J
+
武泽胜 2013-08-07, 11:44
+
Alexander Lorenz 2013-08-07, 11:52