Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Can Flume handle +100k events per seccond?


Copy link to this message
-
Re: Can Flume handle +100k events per seccond?
Hi Bojan,

Sorry about being late in responding to this.

Your setup is of course possible, using host headers, or just headers
supplied by whatever is feeding the data to flume.

The issue is when hdfs has to write to 120 different files, batches get
split up over each file, so the writes are not particularly efficient.

One approach to this is to just write everything to the same file, and
then post process it. Another is to try and group files to sinks. You
could use an interceptor to feed specific header(s) to a specific
channel. There are multiple strategies to this each with their own
benefits and disadvantages, but in the end of the day, writing one big
hdfs file is far more efficient than writing lots of small ones.

On 11/06/2013 06:39 PM, Bojan Kostić wrote:
>
> It was late when i wrote last mail, and my explanation was not clear.
> I will illustrate:
> 20 servers, every one with 60 different log files.
> I was thinking that I could have this kind of structure on hdfs:
> /logs/server0/logstat0.log
> /logs/server0/logstat1.log
> .
> .
> .
> /logs/server20/logstat0.log
> .
> .
> .
>
> But from your info I see that I can't do that.
> I could try to add server id column in every file and then aggregate
> files from all files servers to one file
> /logs/logstat0.log
> /logs/logstat1.log
> .
> .
> .
>
> But again I should have 60 sinks.
>
> On Nov 6, 2013 2:02 AM, "Roshan Naik" <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>
>     I assume you mean  you have 120 source files to be streamed into
>     HDFS.
>     There is not a 1-1 correspondence between source files and
>     destination hdfs files.  If they are on the same host, you can
>     have them all picked up through one source, once channel and one
>     hdfs sink... winding up in a single hdfs file.
>
>     In case you have a config with multiple HDFS sinks (part of a
>     single agent or spanning multiple agents) you want to ensure each
>     HDFS sink writes to a separate file in HDFS.
>
>
>     On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kostić
>     <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>         Hallo Roshan,
>
>         Thanks for response.
>         Bit I am now confused. If I have 120 files, do I need to
>         configure 120 sinks/sources/channels separately? Or I have
>         missed something in the docs.
>         Maybe I should use Fan out flow? But then again I must set 120
>         params.
>
>         Best regards.
>
>         On Nov 5, 2013 8:47 PM, "Roshan Naik" <[EMAIL PROTECTED]
>         <mailto:[EMAIL PROTECTED]>> wrote:
>
>             yes. to avoid them clobbering each other's writes.
>
>
>             On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kostić
>             <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:
>
>                 Sorry for late response. But I lost this email somehow.
>
>                 Thanks for the read, it is nice start even it is old.
>                 And the numbers are really promising.
>
>                 I'm testing memory chanel, there is like 20 data
>                 sources(log servers) with 60 different files each.
>
>                 My RPC client app is basic like in examples. But it
>                 have load balancing for two flume agents which are
>                 writing data to hdfs.
>
>                 I think I read somewhere that you should have one sink
>                 per file. Is that true?
>
>                 Best regards, and sorry again for late response.
>
>                 On Oct 22, 2013 8:50 AM, "Juhani Connolly"
>                 <[EMAIL PROTECTED]
>                 <mailto:[EMAIL PROTECTED]>> wrote:
>
>                     Hi Bojan,
>
>                     This is pretty old, but Mike did some testing on
>                     performance about an year and a half ago:
>
>                     https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Syslog+Performance+Test+2012-04-30
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB