Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume, mail # user - File Channel  performance and fsync

Copy link to this message
Re: File Channel performance and fsync
Jagadish Bihani 2012-10-23, 06:40
Hi Brock

I am using flume 1.2.0.

About the batching : as per user guide "exec source" does have batch
option in 1.2.0 (param name:
batchSize and default value:20) and I
have tried it. Apparently it works fine. And file channel has parameter
"transactionCapacity" set
to 1000 by default. Is that the batch size of file channel?

Anyway even with increased batching I couldn't cross 110-150 KB/sec with
File Channel.
Could you please help me understanding questions I asked in the original
mail of this thread about
fsync lies. Because with disk which "apparently does fsync lie" I get 3
MB/sec in 1 flow.
I don't know whether that actually does "fsync lie" but there is
remarkable difference in fsync
performance on 2 machines which do have almost similar hardware.


On 10/22/2012 07:59 PM, Brock Noland wrote:
> In this cae, it's best to think about FileChannel as if it were a
> database. Let's pretend we are going to insert 1 million rows. If we
> committed on each row, would performance be "good"?  No, everyone
> knows that when you are inserting rows in databases, you want to batch
> 100-1000 rows into a single commit, if you want "good" performance.
> (Quoting good because it's subjective based on the scenario, but in
> this case we mean lots of MB/second).
> Part of the reason behind this logic is that when a database does a
> commit, it does an fsync operation to ensure that all data is written
> to disk and that you will not lose data due to a subsequent power loss.
> FileChannel behaves *exactly* the same. If your "batch" is only a
> single event, file channel will:
> write single event
> fsync
> write single event
> fsync
> As such, if you want "good" performance with FileChannel, you must
> increase your batch size, just like a database. If you have a
> batchSize of say 100, then FileChannel will:
> write single event 0
> write single event 1
> ...
> write single event 99
> fsync
> Which will result in much "better" performance. It's worth noting that
> ExecSource in Flume 1.2, does not have a batchSize and as such each
> event is written and then committed. ExecSource in flume 1.3, which we
> will release soon, does have a configurable batchSize. If you want to
> try that out you can build it from the flume-1.3.0 branch.
> Brock
> On Mon, Oct 22, 2012 at 8:59 AM, Brock Noland <[EMAIL PROTECTED]
> <mailto:[EMAIL PROTECTED]>> wrote:
>     Which version? 1.2 or trunk?
>     On Monday, October 22, 2012 at 8:18 AM, Jagadish Bihani wrote:
>>     Hi
>>     This is the simplistic configuration with which I am getting
>>     lower performance.
>>     Even with 2-tier architecture (cat source - avro sinks - avro
>>     source- HDFS sink)
>>     I get the similar performance with file channel.
>>     Configuration:
>>     ========>>     adServerAgent.sources = avro-collection-source
>>     adServerAgent.channels = fileChannel
>>     adServerAgent.sinks = hdfsSink fileSink
>>     # For each one of the sources, the type is defined
>>     adServerAgent.sources.avro-collection-source.type=exec
>>     adServerAgent.sources.avro-collection-source.command= cat
>>     /home/hadoop/file.tsf
>>     # The channel can be defined as follows.
>>     adServerAgent.sources.avro-collection-source.channels = fileChannel
>>     #Define file sink
>>     adServerAgent.sinks.fileSink.type = file_roll
>>     adServerAgent.sinks.fileSink.sink.directory >>     /home/hadoop/flume_sink*
>>     *
>>     adServerAgent.sinks.fileSink.channel = fileChannel
>>     adServerAgent.channels.fileChannel.type=file
>>     adServerAgent.channels.fileChannel.dataDirs=/home/hadoop/flume/channel/dataDir5
>>     adServerAgent.channels.fileChannel.checkpointDir=/home/hadoop/flume/channel/checkpointDir5
>>     adServerAgent.channels.fileChannel.maxFileSize=4000000000
>>     And it is run with :
>>     JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote
>>     -XX:MaxDirectMemorySize=2g