Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Flume >> mail # user >> File Channel  performance and fsync


+
Jagadish Bihani 2012-10-22, 11:48
+
Denny Ye 2012-10-22, 13:38
+
Jagadish Bihani 2012-10-23, 06:31
Copy link to this message
-
Re: File Channel performance and fsync
Without the fsync guarrantees are weakened a lot more than the fsync
lying case.

Also, you didn't mention the batch size on your avro sink that is
sending data to the avro-source. This is a major factor on your
throughput because each batch causes one sync. If you have big batches,
you'll have few fsyncs and significantly better performance.

I am weirded out by the fact that Danny is getting improved performance
by running multiple parallel file sinks... Are they each on separate
disks or something? I can't imagine what could cause a performance gain
if they were all on the same disk. Would likely expect more write head
skipping around and degradation even...

On 10/23/2012 03:31 PM, Jagadish Bihani wrote:
> Hi Denny
>
> Thanks for the inputs.
> Btw when you say you tested another case without 'fsync'; I think
> you changed the file channel code to comment out 'flush' part of it.
> And if we rely on OS flushing then still it can be reasonably reliable.
> Is that right?
>
> Regards,
> Jagadish
>
> On 10/22/2012 07:08 PM, Denny Ye wrote:
>> hi Jagadish,
>>    I have tested performance of FileChannel recently. Here I can
>> support the test report to you for your thinking and questions at
>> this thread.
>>     Talking about the comparison between FileChannel and File Sink.
>> FileChannel supports both sequential writer and random reader, there
>> have so many times shift of magnetic head, it's slow than the
>> sequential writing much more.
>>     'fsync' command has consuming much time than writing, almost
>> 100times/sec, same as number mentioned from Brock. Also, I didn't
>> know why there have such difference between your two servers. I think
>> it might be related with OS version (usage between fsync and
>> fdatasync instruction) or disk driver (RAID, caching strategy, and so
>> on).
>>     Throughput of single FileChannel is almost 3-5MB/sec in my
>> environment. Thus I used 5 channels with 18MB/sec. It's hard to
>> believe the linear increasing with more channels. Meanwhile, it look
>> like the limit of throughput with 'fsync' operation. I tested another
>> case without 'fsync' operation after each batch, almost
>> 35-40MB/sec(Also, I removed the pre-allocation at disk writing in
>> this case).
>>     Hope useful for you.
>>
>>    PS : I heard that OS has demon thread to flush page cache to
>> disk asynchronously with second latency, does it's effective for
>> amount of data with tolerant loss?
>>
>> -Regards
>> Denny Ye
>>
>> 2012/10/22 Jagadish Bihani <[EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>>
>>
>>     Hi
>>
>>     I am writing this on top of another thread where there was
>>     discussion on "fsync lies" and
>>     only file channel used fsync and not file sink. :
>>
>>     -- I tested the fsync performance on 2 machines  (On 1 machine I
>>     was getting very good throughput
>>     using file channel and on another almost 100 times slower with
>>     almost same hardware configuration.)
>>     using following code
>>
>>
>>     #define PAGESIZE 4096
>>
>>     int main(int argc, char *argv[])
>>     {
>>
>>             char my_write_str[PAGESIZE];
>>             char my_read_str[PAGESIZE];
>>             char *read_filename= argv[1];
>>             int readfd,writefd;
>>
>>             readfd = open(read_filename,O_RDONLY);
>>             writefd = open("written_file",O_WRONLY|O_CREAT,777);
>>             int len=lseek(readfd,0,2);
>>             lseek(readfd,0,0);
>>             int iterations = len/PAGESIZE;
>>             int i;
>>             struct timeval t0,t1;
>>
>>            for(i=0;i<iterations;i++)
>>             {
>>
>>                     read(readfd,my_read_str,PAGESIZE);
>>                     write(writefd,my_read_str,PAGESIZE);
>>     *gettimeofday(&t0,0);**
>>     **                fsync(writefd);**
>>     **              gettimeofday(&t1,0);*
>>                     long elapsed = (t1.tv_sec-t0.tv_sec)*1000000 +
>>     t1.tv_usec-t0.tv_usec;
+
Jagadish Bihani 2012-10-22, 13:18
+
Brock Noland 2012-10-22, 13:59
+
Brock Noland 2012-10-22, 14:29
+
Jagadish Bihani 2012-10-23, 06:40
+
Juhani Connolly 2012-10-23, 07:26
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB