|
|
+
Jay Kreps 2012-05-24, 17:40
+
Chris Burroughs 2012-06-19, 01:21
-
Re: Solution for blocking fsync in 0.8S Ahmed 2012-05-25, 14:09
In practise (at linkedin), how long do you see the calls blocked for during
fsycs? On Thu, May 24, 2012 at 1:40 PM, Jay Kreps <[EMAIL PROTECTED]> wrote: > One issue with using the filesystem for persistence is that the > synchronization in the filesystem is not great. In particular the fsync and > fsyncdata system calls block appends to the file, apparently for the entire > duration of the fsync (which can be quite long). This is documented in some > detail here: > http://antirez.com/post/fsync-different-thread-useless.html > > This is a problem in 0.7 because our definition of a committed message is > one written prior to calling fsync(). This is the only way to guarantee the > message is on disk. We do not hand out any messages to consumers until an > fsync call occurs. The problem is that regardless of whether the fsync is > in a background thread or not it will block any produce requests to the > file. This is buffered a bit in the client since our produce request is > effectively async in 0.7, but it can lead to weird latency spikes > nontheless as this buffering gets filled. > > In 0.8 with replication the definition of a committed message changes to > one that is replicated to multiple machines, not necessarily committed to > disk. This is a different kind of guarantee with different strengths and > weaknesses (pro: data can survive destruction of the file system on one > machine, con: you will lose a few messages if you haven't sync'd and the > power goes out). We will likely retain the flush interval and time settings > for those who want fine grained control over flushing, but it is less > relevant. > > Unfortunately *any* call to fsync will block appends even in a background > thread so how can we give control over physical disk persistence without > introducing high latency for the producer? The answer is that the linux > pdflush daemon actually does a very similar thing to our flush parameters. > pdflush is a daemon running on every linux machine that controls the > writing of buffered/cached data back to disk. It allows you to control the > percentage of memory filled with dirty pages by giving it either a > percentage of memory, a time out for any dirty page to be written, or a > fixed number of dirty bytes. > > The question is, does pdflush block appends? The answer seems to be mostly > no. It locks the page being flushed but not the whole file. The time to > flush one page is actually usually pretty quick (plus I think it may not be > flushing just written pages anyway). I wrote some test code for this and > here are the results: > > I modified the code from the link above. Here are the results from my > desktop (Centos Linux 2.6.32). > > We run the test writing 1024 bytes every 100 us and flushing every 500 us: > > $ ./pdflush-test 1024 100 500 > 21 > 4 > 3 > 3 > 9 > 6 > Sync in 20277 us (0), sleeping for 500 us > 19819 > 7 > 7 > 8 > 38 > Sync in 19470 us (0), sleeping for 500 us > 19048 > 7 > 4 > 3 > 8 > 4 > Sync in 19405 us (0), sleeping for 500 us > 19017 > 6 > 6 > 10 > 6 > Sync in 19410 us (0), sleeping for 500 us > 19025 > 7 > 7 > 11 > 6 > > $ cat /proc/sys/vm/dirty_writeback_centisecs > 100 > $ cat /proc/sys/vm/dirty_expire_centisecs > 500 > > Now run the test with the background flush disabled (rarely running): > $ ./pdflush-test 1024 100 5000000000000 > times.txt > > I ran this for 298,028 writes. The 99.9th percentile for this test is 17 us > and the max time was 2043 us (2ms). > > Here is the test code: > > #include <stdio.h> > #include <unistd.h> > #include <string.h> > #include <sys/types.h> > #include <pthread.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <sys/time.h> > #include <stdlib.h> > > static long long microseconds(void) { > struct timeval tv; > long long mst; > > gettimeofday(&tv, NULL); > mst = ((long long)tv.tv_sec)*1000000; > mst += tv.tv_usec; > return mst; > } > > void *IOThreadEntryPoint(void *arg) { > int fd, retval; > long long start; > long sleep = (long) arg; +
Jay Kreps 2012-05-25, 17:22
+
S Ahmed 2012-05-26, 09:50
|