Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Can Flume handle +100k events per seccond?

Copy link to this message
Re: Can Flume handle +100k events per seccond?
It was late when i wrote last mail, and my explanation was not clear.
I will illustrate:
20 servers, every one with 60 different log files.
I was thinking that I could have this kind of structure on hdfs:

But from your info I see that I can't do that.
I could try to add server id column in every file and then aggregate files
from all files servers to one file

But again I should have 60 sinks.
On Nov 6, 2013 2:02 AM, "Roshan Naik" <[EMAIL PROTECTED]> wrote:

> I assume you mean  you have 120 source files to be streamed into HDFS.
> There is not a 1-1 correspondence between source files and destination
> hdfs files.  If they are on the same host, you can have them all picked up
> through one source, once channel and one hdfs sink... winding up in a
> single hdfs file.
> In case you have a config with multiple HDFS sinks (part of a single agent
> or spanning multiple agents) you want to ensure each HDFS sink writes to a
> separate file in HDFS.
> On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kostić <[EMAIL PROTECTED]>wrote:
>> Hallo Roshan,
>> Thanks for response.
>> Bit I am now confused. If I have 120 files, do I need to configure 120
>> sinks/sources/channels separately? Or I have missed something in the docs.
>> Maybe I should use Fan out flow? But then again I must set 120 params.
>> Best regards.
>> On Nov 5, 2013 8:47 PM, "Roshan Naik" <[EMAIL PROTECTED]> wrote:
>>> yes. to avoid them clobbering each other's writes.
>>> On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kostić <[EMAIL PROTECTED]>wrote:
>>>> Sorry for late response. But I lost this email somehow.
>>>> Thanks for the read, it is nice start even it is old.
>>>> And the numbers are really promising.
>>>> I'm testing memory chanel, there is like 20 data sources(log servers)
>>>> with 60 different files each.
>>>> My RPC client app is basic like in examples. But it have load balancing
>>>> for two flume agents which are writing data to hdfs.
>>>> I think I read somewhere that you should have one sink per file. Is
>>>> that true?
>>>> Best regards, and sorry again for late response.
>>>>  On Oct 22, 2013 8:50 AM, "Juhani Connolly" <
>>>> [EMAIL PROTECTED]> wrote:
>>>>> Hi Bojan,
>>>>> This is pretty old, but Mike did some testing on performance about an
>>>>> year and a half ago:
>>>>> https://cwiki.apache.org/confluence/display/FLUME/
>>>>> Flume+NG+Syslog+Performance+Test+2012-04-30
>>>>> He was getting a max of 70k events/sec on a single machine.
>>>>> Thing is, this is a result of a huge number of variables:
>>>>> - Parallelization of flows allows better parallel processing
>>>>> - Use of memory channel as opposed to a slower consistent channel.
>>>>> - Possibly the source. I have no idea how you wrote your app
>>>>> - Batching of events is important. Also are all events written to one
>>>>> file? Or are they split over many? Every file is separately processed.
>>>>> - Network congestion, your hdfs setup
>>>>> Reaching 100k events per second is definitely possible. The resources
>>>>> you need for it will vary significantly depending on how your setup is. The
>>>>> more HA type features you use, the slower delivery is likely to become. On
>>>>> the flipside, allowing fairly lax conditions that have a small potential
>>>>> for data loss(on crash for example memory channel contents are gone) will
>>>>> allow for close to 100k even on a single machine.
>>>>> On 10/14/2013 09:00 PM, Bojan Kostić wrote:
>>>>>> Hi, this is my first post here. But i play with flume for some time
>>>>>> now.
>>>>>> My question is how well flume scale?
>>>>>> Can Flume ingest +100k events per seccond? Has anyone tried something
>>>>>> like this?
>>>>>> I created simple test and results are really slow.
>>>>>> I wrote simple app with rpc client with fallback using flume sdk