Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> File Channel  performance and fsync


Copy link to this message
-
Re: File Channel performance and fsync
Hi

This is the simplistic configuration with which I am getting lower
performance.
Even with 2-tier architecture (cat source - avro sinks - avro source-
HDFS sink)
I get the similar performance with file channel.

Configuration:
========adServerAgent.sources = avro-collection-source
adServerAgent.channels = fileChannel
adServerAgent.sinks = hdfsSink fileSink

# For each one of the sources, the type is defined
adServerAgent.sources.avro-collection-source.type=exec
adServerAgent.sources.avro-collection-source.command= cat
/home/hadoop/file.tsf

# The channel can be defined as follows.
adServerAgent.sources.avro-collection-source.channels = fileChannel

#Define file sink
adServerAgent.sinks.fileSink.type = file_roll
adServerAgent.sinks.fileSink.sink.directory = /home/hadoop/flume_sink*
*
adServerAgent.sinks.fileSink.channel = fileChannel
adServerAgent.channels.fileChannel.type=file
adServerAgent.channels.fileChannel.dataDirs=/home/hadoop/flume/channel/dataDir5
adServerAgent.channels.fileChannel.checkpointDir=/home/hadoop/flume/channel/checkpointDir5
adServerAgent.channels.fileChannel.maxFileSize=4000000000

And it is run with :
JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote
-XX:MaxDirectMemorySize=2g

Regards,
Jagadish

On 10/22/2012 05:42 PM, Brock Noland wrote:
> Hi,
>
> I'll respond in more depth later, but it would help if you posted your
> configuration file and the version of flume you are using.
>
> Brock
>
> On Mon, Oct 22, 2012 at 6:48 AM, Jagadish Bihani
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
> wrote:
>
>     Hi
>
>     I am writing this on top of another thread where there was
>     discussion on "fsync lies" and
>     only file channel used fsync and not file sink. :
>
>     -- I tested the fsync performance on 2 machines  (On 1 machine I
>     was getting very good throughput
>     using file channel and on another almost 100 times slower with
>     almost same hardware configuration.)
>     using following code
>
>
>     #define PAGESIZE 4096
>
>     int main(int argc, char *argv[])
>     {
>
>             char my_write_str[PAGESIZE];
>             char my_read_str[PAGESIZE];
>             char *read_filename= argv[1];
>             int readfd,writefd;
>
>             readfd = open(read_filename,O_RDONLY);
>             writefd = open("written_file",O_WRONLY|O_CREAT,777);
>             int len=lseek(readfd,0,2);
>             lseek(readfd,0,0);
>             int iterations = len/PAGESIZE;
>             int i;
>             struct timeval t0,t1;
>
>            for(i=0;i<iterations;i++)
>             {
>
>                     read(readfd,my_read_str,PAGESIZE);
>                     write(writefd,my_read_str,PAGESIZE);
>     *gettimeofday(&t0,0);**
>     **                fsync(writefd);**
>     **              gettimeofday(&t1,0);*
>                     long elapsed = (t1.tv_sec-t0.tv_sec)*1000000 +
>     t1.tv_usec-t0.tv_usec;
>                     printf("Elapsed time is= %ld \n",elapsed);
>              }
>             close(readfd);
>             close(writefd);
>     }
>
>
>     -- As expected it requires typically 50000 microseconds for fsync
>     to complete on one machine and 200 microseconds
>     on another machine it took 290 microseconds to complete on an
>     average. So is machine with higher
>     performance is doing a 'fsync lie'?
>     i
>     -- If I have understood it clearly; "fsync lie" means the data is
>     not actually written to disk and it is in
>     some disk/controller buffer.  I) Now if disk loses power due to
>     some shutdown or any other disaster, data will
>     be lost. II) Can data be lost even without it ? (e.g. if it is
>     keeping data in some disk buffer and if fsync is being
>     invoked continuously then will that data can also  be lost? If
>     only part -I is true; then it can be acceptable
>     because probability of shutdown is usually less in production
>     environment. But if even II is true then there is a
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB