Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> File Channel  performance and fsync


Copy link to this message
-
Re: File Channel performance and fsync
In this cae, it's best to think about FileChannel as if it were a database.
Let's pretend we are going to insert 1 million rows. If we committed on
each row, would performance be "good"?  No, everyone knows that when you
are inserting rows in databases, you want to batch 100-1000 rows into a
single commit, if you want "good" performance. (Quoting good because it's
subjective based on the scenario, but in this case we mean lots of
MB/second).

Part of the reason behind this logic is that when a database does a commit,
it does an fsync operation to ensure that all data is written to disk and
that you will not lose data due to a subsequent power loss.

FileChannel behaves *exactly* the same. If your "batch" is only a single
event, file channel will:

write single event
fsync
write single event
fsync

As such, if you want "good" performance with FileChannel, you must increase
your batch size, just like a database. If you have a batchSize of say 100,
then FileChannel will:

write single event 0
write single event 1
...
write single event 99
fsync

Which will result in much "better" performance. It's worth noting that
ExecSource in Flume 1.2, does not have a batchSize and as such each event
is written and then committed. ExecSource in flume 1.3, which we will
release soon, does have a configurable batchSize. If you want to try that
out you can build it from the flume-1.3.0 branch.

Brock

On Mon, Oct 22, 2012 at 8:59 AM, Brock Noland <[EMAIL PROTECTED]> wrote:

>  Which version? 1.2 or trunk?
>
> On Monday, October 22, 2012 at 8:18 AM, Jagadish Bihani wrote:
>
>  Hi
>
> This is the simplistic configuration with which I am getting lower
> performance.
> Even with 2-tier architecture (cat source - avro sinks - avro source- HDFS
> sink)
> I get the similar performance with file channel.
>
> Configuration:
> ========> adServerAgent.sources = avro-collection-source
> adServerAgent.channels = fileChannel
> adServerAgent.sinks = hdfsSink fileSink
>
> # For each one of the sources, the type is defined
> adServerAgent.sources.avro-collection-source.type=exec
> adServerAgent.sources.avro-collection-source.command= cat
> /home/hadoop/file.tsf
>
> # The channel can be defined as follows.
> adServerAgent.sources.avro-collection-source.channels = fileChannel
>
> #Define file sink
> adServerAgent.sinks.fileSink.type = file_roll
> adServerAgent.sinks.fileSink.sink.directory = /home/hadoop/flume_sink*
> *
> adServerAgent.sinks.fileSink.channel = fileChannel
> adServerAgent.channels.fileChannel.type=file
>
> adServerAgent.channels.fileChannel.dataDirs=/home/hadoop/flume/channel/dataDir5
>
> adServerAgent.channels.fileChannel.checkpointDir=/home/hadoop/flume/channel/checkpointDir5
> adServerAgent.channels.fileChannel.maxFileSize=4000000000
>
> And it is run with :
> JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote
> -XX:MaxDirectMemorySize=2g
>
> Regards,
> Jagadish
>
> On 10/22/2012 05:42 PM, Brock Noland wrote:
>
> Hi,
>
>  I'll respond in more depth later, but it would help if you posted your
> configuration file and the version of flume you are using.
>
>  Brock
>
>  On Mon, Oct 22, 2012 at 6:48 AM, Jagadish Bihani <
> [EMAIL PROTECTED]> wrote:
>
>  Hi
>
> I am writing this on top of another thread where there was discussion on
> "fsync lies" and
> only file channel used fsync and not file sink. :
>
> -- I tested the fsync performance on 2 machines  (On 1 machine I was
> getting very good throughput
> using file channel and on another almost 100 times slower with almost same
> hardware configuration.)
> using following code
>
>
> #define PAGESIZE 4096
>
> int main(int argc, char *argv[])
> {
>
>         char my_write_str[PAGESIZE];
>         char my_read_str[PAGESIZE];
>         char *read_filename= argv[1];
>         int readfd,writefd;
>
>         readfd = open(read_filename,O_RDONLY);
>         writefd = open("written_file",O_WRONLY|O_CREAT,777);
>         int len=lseek(readfd,0,2);
>         lseek(readfd,0,0);
>         int iterations = len/PAGESIZE;
Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB