|
|
Are you guys aware of any issues with forceSync=no that could cause the transaction log to get corrupted on a zookeeper crash.
Thanks -Raj
+
Raj N 2012-06-14, 14:56
Jonathan Simms 2012-06-14, 17:03
There's a big warning in the documentation that says that's a possibility. If you don't force both Java and the OS to flush their IO buffers to disk, then you have no guarantees that your data is consistent.
On 6/14/12 10:56 AM, "Raj N" <[EMAIL PROTECTED]> wrote:
>Are you guys aware of any issues with forceSync=no that could cause the >transaction log to get corrupted on a zookeeper crash. > >Thanks >-Raj
+
Jonathan Simms 2012-06-14, 17:03
Sorry, I should have been more specific. By corrupt, I mean that the zookeeper node doesn't come back up on a restart. I would have imagined that zookeeper would sync the lost transactions from its peers. I agree I will have a problem if I have multiple failures. But for a single node failure in a 3-node ensemble, I should be able to recover even if forceSync=no.
Thanks -Raj
On Thu, Jun 14, 2012 at 1:03 PM, Jonathan Simms <[EMAIL PROTECTED]> wrote:
> There's a big warning in the documentation that says that's a possibility. > If you don't force both Java and the OS to flush their IO buffers to disk, > then you have no guarantees that your data is consistent. > > On 6/14/12 10:56 AM, "Raj N" <[EMAIL PROTECTED]> wrote: > > >Are you guys aware of any issues with forceSync=no that could cause the > >transaction log to get corrupted on a zookeeper crash. > > > >Thanks > >-Raj > > >
+
Raj N 2012-06-14, 18:56
Patrick Hunt 2012-06-15, 18:17
For a single node failure I believe so.
Patrick
On Thu, Jun 14, 2012 at 11:56 AM, Raj N <[EMAIL PROTECTED]> wrote: > Sorry, I should have been more specific. By corrupt, I mean that the > zookeeper node doesn't come back up on a restart. I would have imagined > that zookeeper would sync the lost transactions from its peers. I agree I > will have a problem if I have multiple failures. But for a single node > failure in a 3-node ensemble, I should be able to recover even if > forceSync=no. > > Thanks > -Raj > > On Thu, Jun 14, 2012 at 1:03 PM, Jonathan Simms <[EMAIL PROTECTED]> wrote: > >> There's a big warning in the documentation that says that's a possibility. >> If you don't force both Java and the OS to flush their IO buffers to disk, >> then you have no guarantees that your data is consistent. >> >> On 6/14/12 10:56 AM, "Raj N" <[EMAIL PROTECTED]> wrote: >> >> >Are you guys aware of any issues with forceSync=no that could cause the >> >transaction log to get corrupted on a zookeeper crash. >> > >> >Thanks >> >-Raj >> >> >>
+
Patrick Hunt 2012-06-15, 18:17
Thanks Patrick. So back to my original question. Will forceSync=no cause a zookeeper node to not start up cleanly if it crashed? It doesn't happen all the time. But it happened to me once. I had to cleanup all the transaction logs on all the nodes and then bring the nodes back up. For my use case I can tolerate losing data in the worst case scenario since I can regenerate all the data. But I should be able to tolerate a single point of failure as long as the node comes back up cleanly. Can zookeeper recover from a corrupt transaction log using existing snapshots and then replaying messages from its peers?
Thanks -Raj
On Fri, Jun 15, 2012 at 2:17 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
> For a single node failure I believe so. > > Patrick > > On Thu, Jun 14, 2012 at 11:56 AM, Raj N <[EMAIL PROTECTED]> wrote: > > Sorry, I should have been more specific. By corrupt, I mean that the > > zookeeper node doesn't come back up on a restart. I would have imagined > > that zookeeper would sync the lost transactions from its peers. I agree I > > will have a problem if I have multiple failures. But for a single node > > failure in a 3-node ensemble, I should be able to recover even if > > forceSync=no. > > > > Thanks > > -Raj > > > > On Thu, Jun 14, 2012 at 1:03 PM, Jonathan Simms <[EMAIL PROTECTED]> > wrote: > > > >> There's a big warning in the documentation that says that's a > possibility. > >> If you don't force both Java and the OS to flush their IO buffers to > disk, > >> then you have no guarantees that your data is consistent. > >> > >> On 6/14/12 10:56 AM, "Raj N" <[EMAIL PROTECTED]> wrote: > >> > >> >Are you guys aware of any issues with forceSync=no that could cause the > >> >transaction log to get corrupted on a zookeeper crash. > >> > > >> >Thanks > >> >-Raj > >> > >> > >> >
+
Raj N 2012-06-15, 19:45
Patrick Hunt 2012-06-16, 00:33
On Fri, Jun 15, 2012 at 12:45 PM, Raj N <[EMAIL PROTECTED]> wrote: > Can zookeeper recover from a > corrupt transaction log using existing snapshots and then replaying > messages from its peers?
A server will try to recover as best it can (using the snaps/logs it has available), and then talk to the other servers in the quorum to see if anyone else has a more recent committed change. In the case where it doesn't it will download what's necessary to get in sync with the new leader.
What might have happened in your case is that you hit a bug, perhaps a type of corruption that we don't handle successfully. e.g. see ZOOKEEPER-1449
Patrick
+
Patrick Hunt 2012-06-16, 00:33
Flavio Junqueira 2012-06-16, 11:25
There are some corner cases that could lead you to lose data depending on your setting, even if forceSync is enabled. For example, if your disk write cache is enabled, then there are some sequences of events that could lead you to lose updates. With the disk write cache enabled, updates forced to disk could be lost locally, and depending on how many copies exist across servers, it may not be recovered.
Options I'm aware of to get around this are to use write barriers, battery-backed raid controllers, or other solution that uses some form of non-volatile memory. I must also say that I'm not aware of any such a case happening with production use. We observed it in lab experiments, though.
-Flavio
On Jun 16, 2012, at 2:33 AM, Patrick Hunt wrote:
> On Fri, Jun 15, 2012 at 12:45 PM, Raj N <[EMAIL PROTECTED]> wrote: >> Can zookeeper recover from a >> corrupt transaction log using existing snapshots and then replaying >> messages from its peers? > > A server will try to recover as best it can (using the snaps/logs it > has available), and then talk to the other servers in the quorum to > see if anyone else has a more recent committed change. In the case > where it doesn't it will download what's necessary to get in sync with > the new leader. > > What might have happened in your case is that you hit a bug, perhaps a > type of corruption that we don't handle successfully. e.g. see > ZOOKEEPER-1449 > > Patrick
+
Flavio Junqueira 2012-06-16, 11:25
Mahadev, to answer your question,yes we get significantly better performance with forceSync=no. Infact Patrick is probably right. You probably ran the tests when the bug existed. It was one of my team members who raised the forceSync=no not working bug.
Couple of more facts. We use ext4 filesystem (default options) on RHEL 2.6.18-238.el5(Notice its not el 6, so ext4 is back ported. ext3 is the default on el5). We use 500GB SAS drives with BBWC(1 GB, DWC disabled). But somehow still my performance with forceSync=yes is not the best. I have been thinking it might be because of the default options in ext4 which enables the barrier. The barrier essentially makes the BBWC useless. I think I can safely disable the barrier since I have BBWC. I haven't tried this out yet. But what do you guys think?
Thanks -Raj
On Sat, Jun 16, 2012 at 7:25 AM, Flavio Junqueira <[EMAIL PROTECTED]> wrote:
> There are some corner cases that could lead you to lose data depending on > your setting, even if forceSync is enabled. For example, if your disk write > cache is enabled, then there are some sequences of events that could lead > you to lose updates. With the disk write cache enabled, updates forced to > disk could be lost locally, and depending on how many copies exist across > servers, it may not be recovered. > > Options I'm aware of to get around this are to use write barriers, > battery-backed raid controllers, or other solution that uses some form of > non-volatile memory. I must also say that I'm not aware of any such a case > happening with production use. We observed it in lab experiments, though. > > -Flavio > > On Jun 16, 2012, at 2:33 AM, Patrick Hunt wrote: > > > On Fri, Jun 15, 2012 at 12:45 PM, Raj N <[EMAIL PROTECTED]> wrote: > >> Can zookeeper recover from a > >> corrupt transaction log using existing snapshots and then replaying > >> messages from its peers? > > > > A server will try to recover as best it can (using the snaps/logs it > > has available), and then talk to the other servers in the quorum to > > see if anyone else has a more recent committed change. In the case > > where it doesn't it will download what's necessary to get in sync with > > the new leader. > > > > What might have happened in your case is that you hit a bug, perhaps a > > type of corruption that we don't handle successfully. e.g. see > > ZOOKEEPER-1449 > > > > Patrick > >
+
Raj N 2012-06-16, 16:05
Flavio Junqueira 2012-06-18, 14:41
I haven't followed closely the developments of the linux kernel, but my understanding from reading blog posts here and there is that the implementation of the write barrier is not fully reliable, you may lose data. At the same time, it adds some significant performance overhead. Consequently, I would think that you're better off relying upon the BBWC and turning off the write barrier.
-Flavio
On Jun 16, 2012, at 6:05 PM, Raj N wrote:
> Mahadev, to answer your question,yes we get significantly better > performance with forceSync=no. Infact Patrick is probably right. You > probably ran the tests when the bug existed. It was one of my team members > who raised the forceSync=no not working bug. > > Couple of more facts. We use ext4 filesystem (default options) on RHEL > 2.6.18-238.el5(Notice its not el 6, so ext4 is back ported. ext3 is the > default on el5). We use 500GB SAS drives with BBWC(1 GB, DWC disabled). But > somehow still my performance with forceSync=yes is not the best. I have > been thinking it might be because of the default options in ext4 which > enables the barrier. The barrier essentially makes the BBWC useless. I > think I can safely disable the barrier since I have BBWC. I haven't tried > this out yet. But what do you guys think? > > Thanks > -Raj > > On Sat, Jun 16, 2012 at 7:25 AM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > >> There are some corner cases that could lead you to lose data depending on >> your setting, even if forceSync is enabled. For example, if your disk write >> cache is enabled, then there are some sequences of events that could lead >> you to lose updates. With the disk write cache enabled, updates forced to >> disk could be lost locally, and depending on how many copies exist across >> servers, it may not be recovered. >> >> Options I'm aware of to get around this are to use write barriers, >> battery-backed raid controllers, or other solution that uses some form of >> non-volatile memory. I must also say that I'm not aware of any such a case >> happening with production use. We observed it in lab experiments, though. >> >> -Flavio >> >> On Jun 16, 2012, at 2:33 AM, Patrick Hunt wrote: >> >>> On Fri, Jun 15, 2012 at 12:45 PM, Raj N <[EMAIL PROTECTED]> wrote: >>>> Can zookeeper recover from a >>>> corrupt transaction log using existing snapshots and then replaying >>>> messages from its peers? >>> >>> A server will try to recover as best it can (using the snaps/logs it >>> has available), and then talk to the other servers in the quorum to >>> see if anyone else has a more recent committed change. In the case >>> where it doesn't it will download what's necessary to get in sync with >>> the new leader. >>> >>> What might have happened in your case is that you hit a bug, perhaps a >>> type of corruption that we don't handle successfully. e.g. see >>> ZOOKEEPER-1449 >>> >>> Patrick >> >>
+
Flavio Junqueira 2012-06-18, 14:41
Mahadev Konar 2012-06-15, 22:27
Raj, Are you seeing any perf improvements by using that setting? I had run some benchmarks and didnt find much improvement even if I did forceSync = no.
thanks mahadev
On Fri, Jun 15, 2012 at 12:45 PM, Raj N <[EMAIL PROTECTED]> wrote: > Thanks Patrick. So back to my original question. Will forceSync=no cause a > zookeeper node to not start up cleanly if it crashed? It doesn't happen all > the time. But it happened to me once. I had to cleanup all the transaction > logs on all the nodes and then bring the nodes back up. For my use case I > can tolerate losing data in the worst case scenario since I can regenerate > all the data. But I should be able to tolerate a single point of failure as > long as the node comes back up cleanly. Can zookeeper recover from a > corrupt transaction log using existing snapshots and then replaying > messages from its peers? > > Thanks > -Raj > > On Fri, Jun 15, 2012 at 2:17 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > >> For a single node failure I believe so. >> >> Patrick >> >> On Thu, Jun 14, 2012 at 11:56 AM, Raj N <[EMAIL PROTECTED]> wrote: >> > Sorry, I should have been more specific. By corrupt, I mean that the >> > zookeeper node doesn't come back up on a restart. I would have imagined >> > that zookeeper would sync the lost transactions from its peers. I agree I >> > will have a problem if I have multiple failures. But for a single node >> > failure in a 3-node ensemble, I should be able to recover even if >> > forceSync=no. >> > >> > Thanks >> > -Raj >> > >> > On Thu, Jun 14, 2012 at 1:03 PM, Jonathan Simms <[EMAIL PROTECTED]> >> wrote: >> > >> >> There's a big warning in the documentation that says that's a >> possibility. >> >> If you don't force both Java and the OS to flush their IO buffers to >> disk, >> >> then you have no guarantees that your data is consistent. >> >> >> >> On 6/14/12 10:56 AM, "Raj N" <[EMAIL PROTECTED]> wrote: >> >> >> >> >Are you guys aware of any issues with forceSync=no that could cause the >> >> >transaction log to get corrupted on a zookeeper crash. >> >> > >> >> >Thanks >> >> >-Raj >> >> >> >> >> >> >>
+
Mahadev Konar 2012-06-15, 22:27
Patrick Hunt 2012-06-16, 00:27
Was that back when forceSync was broken? ;-) (seriously)
Patrick
On Fri, Jun 15, 2012 at 3:27 PM, Mahadev Konar <[EMAIL PROTECTED]> wrote: > Raj, > Are you seeing any perf improvements by using that setting? I had run > some benchmarks and didnt find much improvement even if I did > forceSync = no. > > thanks > mahadev > > On Fri, Jun 15, 2012 at 12:45 PM, Raj N <[EMAIL PROTECTED]> wrote: >> Thanks Patrick. So back to my original question. Will forceSync=no cause a >> zookeeper node to not start up cleanly if it crashed? It doesn't happen all >> the time. But it happened to me once. I had to cleanup all the transaction >> logs on all the nodes and then bring the nodes back up. For my use case I >> can tolerate losing data in the worst case scenario since I can regenerate >> all the data. But I should be able to tolerate a single point of failure as >> long as the node comes back up cleanly. Can zookeeper recover from a >> corrupt transaction log using existing snapshots and then replaying >> messages from its peers? >> >> Thanks >> -Raj >> >> On Fri, Jun 15, 2012 at 2:17 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> >>> For a single node failure I believe so. >>> >>> Patrick >>> >>> On Thu, Jun 14, 2012 at 11:56 AM, Raj N <[EMAIL PROTECTED]> wrote: >>> > Sorry, I should have been more specific. By corrupt, I mean that the >>> > zookeeper node doesn't come back up on a restart. I would have imagined >>> > that zookeeper would sync the lost transactions from its peers. I agree I >>> > will have a problem if I have multiple failures. But for a single node >>> > failure in a 3-node ensemble, I should be able to recover even if >>> > forceSync=no. >>> > >>> > Thanks >>> > -Raj >>> > >>> > On Thu, Jun 14, 2012 at 1:03 PM, Jonathan Simms <[EMAIL PROTECTED]> >>> wrote: >>> > >>> >> There's a big warning in the documentation that says that's a >>> possibility. >>> >> If you don't force both Java and the OS to flush their IO buffers to >>> disk, >>> >> then you have no guarantees that your data is consistent. >>> >> >>> >> On 6/14/12 10:56 AM, "Raj N" <[EMAIL PROTECTED]> wrote: >>> >> >>> >> >Are you guys aware of any issues with forceSync=no that could cause the >>> >> >transaction log to get corrupted on a zookeeper crash. >>> >> > >>> >> >Thanks >>> >> >-Raj >>> >> >>> >> >>> >> >>>
+
Patrick Hunt 2012-06-16, 00:27
Vitalii Tymchyshyn 2012-06-18, 12:55
Hello.
Please note that in general not doing fsync may lead to inconsistent data where latter data was written and earlier was not. This may not be a problem if ZooKeeper validates all it's data on start, but in the worst scenario I can imagine it will read "last transaction counter" updated to the latest value but will data not consistent with the counter. So the problem may be not to handle error, but to detect error.
Best regards, Vitalii Tymchyshyn
14.06.12 21:56, Raj N написав(ла): > Sorry, I should have been more specific. By corrupt, I mean that the > zookeeper node doesn't come back up on a restart. I would have imagined > that zookeeper would sync the lost transactions from its peers. I agree I > will have a problem if I have multiple failures. But for a single node > failure in a 3-node ensemble, I should be able to recover even if > forceSync=no. > > Thanks > -Raj > > On Thu, Jun 14, 2012 at 1:03 PM, Jonathan Simms<[EMAIL PROTECTED]> wrote: > >> There's a big warning in the documentation that says that's a possibility. >> If you don't force both Java and the OS to flush their IO buffers to disk, >> then you have no guarantees that your data is consistent. >> >> On 6/14/12 10:56 AM, "Raj N"<[EMAIL PROTECTED]> wrote: >> >>> Are you guys aware of any issues with forceSync=no that could cause the >>> transaction log to get corrupted on a zookeeper crash. >>> >>> Thanks >>> -Raj >> >>
+
Vitalii Tymchyshyn 2012-06-18, 12:55
|
|