Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> Can Flume handle +100k events per seccond?


+
Bojan Kostić 2013-10-14, 12:00
+
Juhani Connolly 2013-10-22, 06:49
+
Bojan Kostić 2013-11-05, 12:34
+
Roshan Naik 2013-11-05, 19:46
+
Bojan Kostić 2013-11-06, 00:41
+
Roshan Naik 2013-11-06, 01:02
+
Bojan Kostić 2013-11-06, 09:39
Copy link to this message
-
Re: Can Flume handle +100k events per seccond?
A single hdfs sink should be able to write to multiple files if the events
are annotated with 'host' header and %{host} escape sequence is used for
the hdfs.path config. Depending on the host name value in each event's
header, and the sink will write the event to the host specific file.
You can have all the events coming in form a single server annotated with
the hostname of that server.
I am not sure if there is a way to ensure that each file on the source ends
up as a separate file in HDFS.
On Wed, Nov 6, 2013 at 1:39 AM, Bojan Kostić <[EMAIL PROTECTED]> wrote:

> It was late when i wrote last mail, and my explanation was not clear.
> I will illustrate:
> 20 servers, every one with 60 different log files.
> I was thinking that I could have this kind of structure on hdfs:
> /logs/server0/logstat0.log
> /logs/server0/logstat1.log
> .
> .
> .
> /logs/server20/logstat0.log
> .
> .
> .
>
> But from your info I see that I can't do that.
> I could try to add server id column in every file and then aggregate files
> from all files servers to one file
> /logs/logstat0.log
> /logs/logstat1.log
> .
> .
> .
>
> But again I should have 60 sinks.
> On Nov 6, 2013 2:02 AM, "Roshan Naik" <[EMAIL PROTECTED]> wrote:
>
>> I assume you mean  you have 120 source files to be streamed into HDFS.
>> There is not a 1-1 correspondence between source files and destination
>> hdfs files.  If they are on the same host, you can have them all picked up
>> through one source, once channel and one hdfs sink... winding up in a
>> single hdfs file.
>>
>> In case you have a config with multiple HDFS sinks (part of a single
>> agent or spanning multiple agents) you want to ensure each HDFS sink writes
>> to a separate file in HDFS.
>>
>>
>> On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kostić <[EMAIL PROTECTED]>wrote:
>>
>>> Hallo Roshan,
>>>
>>> Thanks for response.
>>> Bit I am now confused. If I have 120 files, do I need to configure 120
>>> sinks/sources/channels separately? Or I have missed something in the docs.
>>> Maybe I should use Fan out flow? But then again I must set 120 params.
>>>
>>> Best regards.
>>> On Nov 5, 2013 8:47 PM, "Roshan Naik" <[EMAIL PROTECTED]> wrote:
>>>
>>>> yes. to avoid them clobbering each other's writes.
>>>>
>>>>
>>>> On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kostić <[EMAIL PROTECTED]>wrote:
>>>>
>>>>> Sorry for late response. But I lost this email somehow.
>>>>>
>>>>> Thanks for the read, it is nice start even it is old.
>>>>> And the numbers are really promising.
>>>>>
>>>>> I'm testing memory chanel, there is like 20 data sources(log servers)
>>>>> with 60 different files each.
>>>>>
>>>>> My RPC client app is basic like in examples. But it have load
>>>>> balancing for two flume agents which are writing data to hdfs.
>>>>>
>>>>> I think I read somewhere that you should have one sink per file. Is
>>>>> that true?
>>>>>
>>>>> Best regards, and sorry again for late response.
>>>>>  On Oct 22, 2013 8:50 AM, "Juhani Connolly" <
>>>>> [EMAIL PROTECTED]> wrote:
>>>>>
>>>>>> Hi Bojan,
>>>>>>
>>>>>> This is pretty old, but Mike did some testing on performance about an
>>>>>> year and a half ago:
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/FLUME/
>>>>>> Flume+NG+Syslog+Performance+Test+2012-04-30
>>>>>>
>>>>>> He was getting a max of 70k events/sec on a single machine.
>>>>>>
>>>>>> Thing is, this is a result of a huge number of variables:
>>>>>> - Parallelization of flows allows better parallel processing
>>>>>> - Use of memory channel as opposed to a slower consistent channel.
>>>>>> - Possibly the source. I have no idea how you wrote your app
>>>>>> - Batching of events is important. Also are all events written to one
>>>>>> file? Or are they split over many? Every file is separately processed.
>>>>>> - Network congestion, your hdfs setup
>>>>>>
>>>>>> Reaching 100k events per second is definitely possible. The resources
>>>>>> you need for it will vary significantly depending on how your setup is. The

CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
+
Juhani Connolly 2013-11-18, 08:50