Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume, mail # user - File Channel  performance and fsync


+
Jagadish Bihani 2012-10-22, 11:48
+
Denny Ye 2012-10-22, 13:38
+
Jagadish Bihani 2012-10-23, 06:31
+
Juhani Connolly 2012-10-23, 07:08
Copy link to this message
-
Re: File Channel performance and fsync
Jagadish Bihani 2012-10-22, 13:18
Hi

This is the simplistic configuration with which I am getting lower
performance.
Even with 2-tier architecture (cat source - avro sinks - avro source-
HDFS sink)
I get the similar performance with file channel.

Configuration:
========adServerAgent.sources = avro-collection-source
adServerAgent.channels = fileChannel
adServerAgent.sinks = hdfsSink fileSink

# For each one of the sources, the type is defined
adServerAgent.sources.avro-collection-source.type=exec
adServerAgent.sources.avro-collection-source.command= cat
/home/hadoop/file.tsf

# The channel can be defined as follows.
adServerAgent.sources.avro-collection-source.channels = fileChannel

#Define file sink
adServerAgent.sinks.fileSink.type = file_roll
adServerAgent.sinks.fileSink.sink.directory = /home/hadoop/flume_sink*
*
adServerAgent.sinks.fileSink.channel = fileChannel
adServerAgent.channels.fileChannel.type=file
adServerAgent.channels.fileChannel.dataDirs=/home/hadoop/flume/channel/dataDir5
adServerAgent.channels.fileChannel.checkpointDir=/home/hadoop/flume/channel/checkpointDir5
adServerAgent.channels.fileChannel.maxFileSize=4000000000

And it is run with :
JAVA_OPTS = -Xms500m -Xmx700m -Dcom.sun.management.jmxremote
-XX:MaxDirectMemorySize=2g

Regards,
Jagadish

On 10/22/2012 05:42 PM, Brock Noland wrote:
> Hi,
>
> I'll respond in more depth later, but it would help if you posted your
> configuration file and the version of flume you are using.
>
> Brock
>
> On Mon, Oct 22, 2012 at 6:48 AM, Jagadish Bihani
> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
> wrote:
>
>     Hi
>
>     I am writing this on top of another thread where there was
>     discussion on "fsync lies" and
>     only file channel used fsync and not file sink. :
>
>     -- I tested the fsync performance on 2 machines  (On 1 machine I
>     was getting very good throughput
>     using file channel and on another almost 100 times slower with
>     almost same hardware configuration.)
>     using following code
>
>
>     #define PAGESIZE 4096
>
>     int main(int argc, char *argv[])
>     {
>
>             char my_write_str[PAGESIZE];
>             char my_read_str[PAGESIZE];
>             char *read_filename= argv[1];
>             int readfd,writefd;
>
>             readfd = open(read_filename,O_RDONLY);
>             writefd = open("written_file",O_WRONLY|O_CREAT,777);
>             int len=lseek(readfd,0,2);
>             lseek(readfd,0,0);
>             int iterations = len/PAGESIZE;
>             int i;
>             struct timeval t0,t1;
>
>            for(i=0;i<iterations;i++)
>             {
>
>                     read(readfd,my_read_str,PAGESIZE);
>                     write(writefd,my_read_str,PAGESIZE);
>     *gettimeofday(&t0,0);**
>     **                fsync(writefd);**
>     **              gettimeofday(&t1,0);*
>                     long elapsed = (t1.tv_sec-t0.tv_sec)*1000000 +
>     t1.tv_usec-t0.tv_usec;
>                     printf("Elapsed time is= %ld \n",elapsed);
>              }
>             close(readfd);
>             close(writefd);
>     }
>
>
>     -- As expected it requires typically 50000 microseconds for fsync
>     to complete on one machine and 200 microseconds
>     on another machine it took 290 microseconds to complete on an
>     average. So is machine with higher
>     performance is doing a 'fsync lie'?
>     i
>     -- If I have understood it clearly; "fsync lie" means the data is
>     not actually written to disk and it is in
>     some disk/controller buffer.  I) Now if disk loses power due to
>     some shutdown or any other disaster, data will
>     be lost. II) Can data be lost even without it ? (e.g. if it is
>     keeping data in some disk buffer and if fsync is being
>     invoked continuously then will that data can also  be lost? If
>     only part -I is true; then it can be acceptable
>     because probability of shutdown is usually less in production
>     environment. But if even II is true then there is a
+
Brock Noland 2012-10-22, 13:59
+
Brock Noland 2012-10-22, 14:29
+
Jagadish Bihani 2012-10-23, 06:40
+
Juhani Connolly 2012-10-23, 07:26