Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> File Channel  performance and fsync


Copy link to this message
-
Re: File Channel performance and fsync
hi Jagadish,
   I have tested performance of FileChannel recently. Here I can support
the test report to you for your thinking and questions at this thread.
    Talking about the comparison between FileChannel and File Sink.
FileChannel supports both sequential writer and random reader, there have
so many times shift of magnetic head, it's slow than the sequential writing
much more.
    'fsync' command has consuming much time than writing, almost
100times/sec, same as number mentioned from Brock. Also, I didn't know why
there have such difference between your two servers. I think it might be
related with OS version (usage between fsync and fdatasync instruction) or
disk driver (RAID, caching strategy, and so on).
    Throughput of single FileChannel is almost 3-5MB/sec in my environment.
Thus I used 5 channels with 18MB/sec. It's hard to believe the linear
increasing with more channels. Meanwhile, it look like the limit of
throughput with 'fsync' operation. I tested another case without 'fsync'
operation after each batch, almost 35-40MB/sec(Also, I removed the
pre-allocation at disk writing in this case).
    Hope useful for you.

   PS : I heard that OS has demon thread to flush page cache to
disk asynchronously with second latency, does it's effective for amount of
data with tolerant loss?
-Regards
Denny Ye

2012/10/22 Jagadish Bihani <[EMAIL PROTECTED]>

>  Hi
>
> I am writing this on top of another thread where there was discussion on
> "fsync lies" and
> only file channel used fsync and not file sink. :
>
> -- I tested the fsync performance on 2 machines  (On 1 machine I was
> getting very good throughput
> using file channel and on another almost 100 times slower with almost same
> hardware configuration.)
> using following code
>
>
> #define PAGESIZE 4096
>
> int main(int argc, char *argv[])
> {
>
>         char my_write_str[PAGESIZE];
>         char my_read_str[PAGESIZE];
>         char *read_filename= argv[1];
>         int readfd,writefd;
>
>         readfd = open(read_filename,O_RDONLY);
>         writefd = open("written_file",O_WRONLY|O_CREAT,777);
>         int len=lseek(readfd,0,2);
>         lseek(readfd,0,0);
>         int iterations = len/PAGESIZE;
>         int i;
>         struct timeval t0,t1;
>
>        for(i=0;i<iterations;i++)
>         {
>
>                 read(readfd,my_read_str,PAGESIZE);
>                 write(writefd,my_read_str,PAGESIZE);
>                 *gettimeofday(&t0,0);**
> **                fsync(writefd);**
> **              gettimeofday(&t1,0);*
>                 long elapsed = (t1.tv_sec-t0.tv_sec)*1000000 +
> t1.tv_usec-t0.tv_usec;
>                 printf("Elapsed time is= %ld \n",elapsed);
>          }
>         close(readfd);
>         close(writefd);
> }
>
>
> -- As expected it requires typically 50000 microseconds for fsync to
> complete on one machine and 200 microseconds
> on another machine it took 290 microseconds to complete on an average. So
> is machine with higher
> performance is doing a 'fsync lie'?
> i
> -- If I have understood it clearly; "fsync lie" means the data is not
> actually written to disk and it is in
> some disk/controller buffer.  I) Now if disk loses power due to some
> shutdown or any other disaster, data will
> be lost. II) Can data be lost even without it ? (e.g. if it is keeping
> data in some disk buffer and if fsync is being
> invoked continuously then will that data can also  be lost? If only part
> -I is true; then it can be acceptable
> because probability of shutdown is usually less in production environment.
> But if even II is true then there is a
> problem.
>
> -- But on the machine where disk doesn't lie performance of flume using
> File channel is very low (I have seen it
> maximum 100 KB/sec even with sufficient  DirectMemory allocation.) Does
> anybody have stats about throughput
> of file channel ? Is anybody getting better performance with file channel
> (without fsync lies). What is the recommended
> usage of it for an average scenario ? (Transferring files of few MBs to