Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - understanding flume performance


+
Raymond Ng 2012-07-31, 15:19
+
Denny Ye 2012-07-31, 17:27
Copy link to this message
-
Re: understanding flume performance
Juhani Connolly 2012-08-01, 06:08
This isn't exactly an answer to your question, but try increasing the
batch sizes on both your avro sink and hdfs sinks. These will increase
the throughput of your file channel significantly.

If your memory channel on agent 1 gets full up, it will be limited to
whatever throughput agent2 has. It will indicate this in the logs. You
may also want to try using a snapshot of 1.3 which allows ganglia
integration and is very useful for watching throughput, channel
capacities, and take/put attempt/success counts. If it gets full the
only thing to do is  to increase throughput at agent2 by increasing
transaction sizes(or by using separate disks for checkpoint/data dirs).

On 08/01/2012 12:19 AM, Raymond Ng wrote:
> good day all, sorry for the long email
> I'd like to know how to gauge where the performance bottleneck is with
> different types of channels used
> I have a demo environemnt which looks a bit like this
> Setup 1
> Agent 1 ( Exec Source, Memory Channel and Avro Sink with 1 GB JVM)
> streaming data to
> Agent 2 ( Avro Source, Memory Channel and HDFS Sink with 1.5 GB JVM)
> the memory channel both have 1,000,000 capacity and 10,000 transaction
> capacity and I managed to achieve ~8000 records/sec in the Exec Source
> of Agent 1, and I'm not too concerned with how long it takes for Agent
> 2 to insert into HDFS
>
> and when I changed Agent 2 to use FileChannel
> Setup 2
> Agent 1 ( Exec Source, Memory Channel and Avro Sink with 2 GB JVM)
> streaming data to
> Agent 2 ( Avro Source, File Channel and HDFS Sink with 1.0 GB JVM),  
> the File Channel has the same capacity and transaction capacity as the
> memory channel stated above
> I've doubled the JVM for Agent 1 knowing that it needs to have a
> bigger buffer to handle the same throughout from the Exec source, as
> Agent 2 will be slower buffering records to disk before writing to HDFS.
> now I achieved ~4000 records per second in Exce source of Agent 1,
> however I wasn't expecting the Exec source to slow down on
> the throughput as its getting the same input from tailing the same file
> Is the decrease in the source throughput in Agent 1 to do with Agent 2
> taking much longer to commit the events into the file channel which
> causes a knock-on on Agent 1 to release the records from its memory
> channel?
> I thought the performance on the source is determined by how quickly
> it can commit the events to the channel, the fact that the sink can't
> consume the events as quick as they are put in by the source should
> not affect the speed the source is committing to the channel?   I say
> this because I have come across ChannelException where it suggested
> the sinks are not keeping up with the sources, kind of suggests to me
> that the sink will not slow down the source in terms of channel commit
> hope it makes sense
> thanks for any advice
> --
> Rgds
> Ray