|
|
=?KOI8-U?B?96bUwcymyiD0yc... 2012-04-15, 17:13
I really would not like for any library to perform a System.exit call. This would make huge program exit out of sudden (think about j2ee, you may be bitten by security manager). Note that there are more or less safe errors, like StackOverflowError. Also System.exit make testing nightmare. E.g. maven2 silently skips any tests after the one that calls System.exit. And everything's green. As for me good options are: 1) Call user-provided uncaught exception handler. Use the one from the thread that created the connection if one is not specified explicity. 1) Stop everything, notifying user with a global watcher. If it's possible, clean any static state (e.g. restart threads) and allow to restart connection. In any case, call user code. Good system already know how to react (it may want to send email to admin), allow it to perform well. Best regards, Vitalii Tymchyshyn. 2012/4/13 Camille Fournier <[EMAIL PROTECTED]> > Hi everyone, > > I'm trying to evaluate a patch that Jeremy Stribling has submitted, and I'd > like some feedback from the user base on it. > https://issues.apache.org/jira/browse/ZOOKEEPER-1442> > The current behavior of ZK when we get an uncaught exception is to log it > and try to move on. This is arguably not the right thing to do, and will > possibly cause ZK to limp along with a bad VM (say, in an OOM state) for > longer than it should. > The patch proposes that when we get an instance of java.lang.Error, we > should do a system.exit to fast-fail the process. With the possible > exception of ThreadDeath (which may or may not be an unrecoverable system > state depending on the thread), I think this makes sense, but I would like > to hear from others if they have an opinion. I think it's better to kill > the process and let your monitoring services detect process death (and thus > restart) than possibly linger unresponsive for a while, are there scenarios > that we're missing where this error can occur and you wouldn't want the > process killed? > > Thanks for your feedback, > > Camille > -- Best regards, Vitalii Tymchyshyn
+
=?KOI8-U?B?96bUwcymyiD0yc... 2012-04-15, 17:13
Scott Fines 2012-04-16, 17:01
For testing, including Michi's suggestion to provide a mechanism to disable this behavior should be sufficient. Scott On Mon, Apr 16, 2012 at 11:52 AM, Ishaaq Chandy <[EMAIL PROTECTED]> wrote: > Fair enough, if you think it is an anti-pattern then perhaps you can > suggest an alternative that allows one to write tests that are (in > descending order of priority): > > 1. Quick and easy to setup and teardown > 2. Repeatably testable and debuggable in an IDE without having to resort to > the external build tool > 3. Testable in parallel, i.e. having more than one build running on a CI > server, so need some way to avoid port clashes > 4. Cross-platform - i.e. run the exact same build sequence on multiple > OSes. > > Ishaaq > > On 16 April 2012 22:55, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > I believe that this change is inspired by someone that runs zk embedded. > > Personally I'm not moved by the testing argument, embedding the server > for > > testing is a bit of an anti pattern in my mind. > > > > From my phone > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <[EMAIL PROTECTED]> wrote: > > > > > I'd go so far as to say that even the server-code should avoid > > System.exit. > > > Just because it is "meant" to be a standalone system doesn't mean that > > code > > > that makes it impossible to embed it should be encouraged. > > > > > > For e.g, we embed a local version of ZK to be used inside our unit > tests. > > > This makes it much easier for us to control ZK to coincide with test > > > expectations as well as making for much faster build times. It would > be a > > > shame if the embedded ZK started killing the JVM. > > > > > > Ishaaq > > > > > > On 16 April 2012 04:28, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > > > > > This is a good point. > > > > I think this change should be fine for the server portion of the > code, > > > > since it's designed to be run as a standalone system. But for the > > > > client connection to also call system.exit on such an error is > > > > overreaching for all the reasons listed below. > > > > > > > > C > > > > > > > > 2012/4/15 Віталій Тимчишин <[EMAIL PROTECTED]>: > > > > > I really would not like for any library to perform a System.exit > > call. > > > > This > > > > > would make huge program exit out of sudden (think about j2ee, you > may > > > be > > > > > bitten by security manager). Note that there are more or less safe > > > > errors, > > > > > like StackOverflowError. > > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips > > any > > > > > tests after the one that calls System.exit. And everything's green. > > > > > As for me good options are: > > > > > 1) Call user-provided uncaught exception handler. Use the one from > > the > > > > > thread that created the connection if one is not specified > explicity. > > > > > 1) Stop everything, notifying user with a global watcher. If it's > > > > possible, > > > > > clean any static state (e.g. restart threads) and allow to restart > > > > > connection. > > > > > In any case, call user code. Good system already know how to react > > (it > > > > may > > > > > want to send email to admin), allow it to perform well. > > > > > > > > > > Best regards, Vitalii Tymchyshyn. > > > > > > > > > > 2012/4/13 Camille Fournier <[EMAIL PROTECTED]> > > > > > > > > > >> Hi everyone, > > > > >> > > > > >> I'm trying to evaluate a patch that Jeremy Stribling has > submitted, > > > and > > > > I'd > > > > >> like some feedback from the user base on it. > > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442> > > > >> > > > > >> The current behavior of ZK when we get an uncaught exception is to > > log > > > > it > > > > >> and try to move on. This is arguably not the right thing to do, > and > > > will > > > > >> possibly cause ZK to limp along with a bad VM (say, in an OOM > state) > > > for > > > > >> longer than it should. > > > > >> The patch proposes that when we get an instance of > java.lang.Error, > > we
+
Scott Fines 2012-04-16, 17:01
John Sirois 2012-04-16, 18:09
On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <[EMAIL PROTECTED]> wrote: > Fair enough, if you think it is an anti-pattern then perhaps you can > suggest an alternative that allows one to write tests that are (in > descending order of priority): > > 1. Quick and easy to setup and teardown > 2. Repeatably testable and debuggable in an IDE without having to resort to > the external build tool > 3. Testable in parallel, i.e. having more than one build running on a CI > server, so need some way to avoid port clashes > 4. Cross-platform - i.e. run the exact same build sequence on multiple > OSes. > We also embed for testing for all the reasons above with good success. In fact - I like the idea of the server product directly supporting alternate mains - although this may be more burden on zk devs initially. This would allow an org to write their own main that plugs in an uncaught handler and does whatever is appropriate in their deploy environment. > > Ishaaq > > On 16 April 2012 22:55, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > I believe that this change is inspired by someone that runs zk embedded. > > Personally I'm not moved by the testing argument, embedding the server > for > > testing is a bit of an anti pattern in my mind. > > > > From my phone > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <[EMAIL PROTECTED]> wrote: > > > > > I'd go so far as to say that even the server-code should avoid > > System.exit. > > > Just because it is "meant" to be a standalone system doesn't mean that > > code > > > that makes it impossible to embed it should be encouraged. > > > > > > For e.g, we embed a local version of ZK to be used inside our unit > tests. > > > This makes it much easier for us to control ZK to coincide with test > > > expectations as well as making for much faster build times. It would > be a > > > shame if the embedded ZK started killing the JVM. > > > > > > Ishaaq > > > > > > On 16 April 2012 04:28, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > > > > > This is a good point. > > > > I think this change should be fine for the server portion of the > code, > > > > since it's designed to be run as a standalone system. But for the > > > > client connection to also call system.exit on such an error is > > > > overreaching for all the reasons listed below. > > > > > > > > C > > > > > > > > 2012/4/15 Віталій Тимчишин <[EMAIL PROTECTED]>: > > > > > I really would not like for any library to perform a System.exit > > call. > > > > This > > > > > would make huge program exit out of sudden (think about j2ee, you > may > > > be > > > > > bitten by security manager). Note that there are more or less safe > > > > errors, > > > > > like StackOverflowError. > > > > > Also System.exit make testing nightmare. E.g. maven2 silently skips > > any > > > > > tests after the one that calls System.exit. And everything's green. > > > > > As for me good options are: > > > > > 1) Call user-provided uncaught exception handler. Use the one from > > the > > > > > thread that created the connection if one is not specified > explicity. > > > > > 1) Stop everything, notifying user with a global watcher. If it's > > > > possible, > > > > > clean any static state (e.g. restart threads) and allow to restart > > > > > connection. > > > > > In any case, call user code. Good system already know how to react > > (it > > > > may > > > > > want to send email to admin), allow it to perform well. > > > > > > > > > > Best regards, Vitalii Tymchyshyn. > > > > > > > > > > 2012/4/13 Camille Fournier <[EMAIL PROTECTED]> > > > > > > > > > >> Hi everyone, > > > > >> > > > > >> I'm trying to evaluate a patch that Jeremy Stribling has > submitted, > > > and > > > > I'd > > > > >> like some feedback from the user base on it. > > > > >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442> > > > >> > > > > >> The current behavior of ZK when we get an uncaught exception is to > > log > > > > it > > > > >> and try to move on. This is arguably not the right thing to do,
+
John Sirois 2012-04-16, 18:09
Scott Fines 2012-04-16, 18:27
Not to start a flame war here..
Mocks are good for a lot of purposes, but I can think of a few cases where testing failure states is a lot easier if you are actively contacting a ZK server (like testing a failed client connection's consequences). To that end, it seems like a good idea to support this kind of testing. So far, the easiest way seems to be embedding a ZK Server.
On the other hand, it doesn't seem likely that an active, production ZK cluster is going to want any kind of unusual behavior. It's almost always better to have some kind of behavior that is very well defined and predictable happen, even if that behavior is catastrophic. In this case, having a production server shut itself off whenever it doesn't know how to recover is very predictable, and can be designed around. Leaving that cluster in an awkward state that may or may not behave predictably is much harder to work with.
I personally can't see much else that you'd want to do in this scenario that falls far enough outside these two patterns that it's worth supporting custom handlers. It seems like the support time would be better spent developing and planning around these two states.
Scott
On Mon, Apr 16, 2012 at 1:09 PM, John Sirois <[EMAIL PROTECTED]> wrote:
> On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <[EMAIL PROTECTED]> wrote: > > > Fair enough, if you think it is an anti-pattern then perhaps you can > > suggest an alternative that allows one to write tests that are (in > > descending order of priority): > > > > 1. Quick and easy to setup and teardown > > 2. Repeatably testable and debuggable in an IDE without having to resort > to > > the external build tool > > 3. Testable in parallel, i.e. having more than one build running on a CI > > server, so need some way to avoid port clashes > > 4. Cross-platform - i.e. run the exact same build sequence on multiple > > OSes. > > > > We also embed for testing for all the reasons above with good success. In > fact - I like the idea of the server product directly supporting alternate > mains - although this may be more burden on zk devs initially. This would > allow an org to write their own main that plugs in an uncaught handler and > does whatever is appropriate in their deploy environment. > > > > > > Ishaaq > > > > On 16 April 2012 22:55, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > > > I believe that this change is inspired by someone that runs zk > embedded. > > > Personally I'm not moved by the testing argument, embedding the server > > for > > > testing is a bit of an anti pattern in my mind. > > > > > > From my phone > > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <[EMAIL PROTECTED]> wrote: > > > > > > > I'd go so far as to say that even the server-code should avoid > > > System.exit. > > > > Just because it is "meant" to be a standalone system doesn't mean > that > > > code > > > > that makes it impossible to embed it should be encouraged. > > > > > > > > For e.g, we embed a local version of ZK to be used inside our unit > > tests. > > > > This makes it much easier for us to control ZK to coincide with test > > > > expectations as well as making for much faster build times. It would > > be a > > > > shame if the embedded ZK started killing the JVM. > > > > > > > > Ishaaq > > > > > > > > On 16 April 2012 04:28, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > > > > > > > This is a good point. > > > > > I think this change should be fine for the server portion of the > > code, > > > > > since it's designed to be run as a standalone system. But for the > > > > > client connection to also call system.exit on such an error is > > > > > overreaching for all the reasons listed below. > > > > > > > > > > C > > > > > > > > > > 2012/4/15 Віталій Тимчишин <[EMAIL PROTECTED]>: > > > > > > I really would not like for any library to perform a System.exit > > > call. > > > > > This > > > > > > would make huge program exit out of sudden (think about j2ee, you > > may > > > > be > > > > > > bitten by security manager). Note that there are more or less
+
Scott Fines 2012-04-16, 18:27
Camille Fournier 2012-04-16, 18:40
I've done extensive ZK client testing mostly using mocks with great success. I agree that there's cases when embedding a ZK is probably the easiest way to go, but it is actually much easier to simulate most failure cases in ZK via mocks than running an embedded server. It does require more up-front developer thought to set up the tests, but that thought pays dividends in that you actually are then forced to understand the possible failures and simulate them. It seems ultimately like a flag is the right way to go here, so I'll leave that feedback on the ticket.
C
2012/4/16 Scott Fines <[EMAIL PROTECTED]>
> Not to start a flame war here.. > > Mocks are good for a lot of purposes, but I can think of a few cases where > testing failure states is a lot easier if you are actively contacting a ZK > server (like testing a failed client connection's consequences). To that > end, it seems like a good idea to support this kind of testing. So far, the > easiest way seems to be embedding a ZK Server. > > On the other hand, it doesn't seem likely that an active, production ZK > cluster is going to want any kind of unusual behavior. It's almost always > better to have some kind of behavior that is very well defined and > predictable happen, even if that behavior is catastrophic. In this case, > having a production server shut itself off whenever it doesn't know how to > recover is very predictable, and can be designed around. Leaving that > cluster in an awkward state that may or may not behave predictably is much > harder to work with. > > I personally can't see much else that you'd want to do in this scenario > that falls far enough outside these two patterns that it's worth supporting > custom handlers. It seems like the support time would be better spent > developing and planning around these two states. > > Scott > > > > On Mon, Apr 16, 2012 at 1:09 PM, John Sirois <[EMAIL PROTECTED]> > wrote: > > > On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <[EMAIL PROTECTED]> > wrote: > > > > > Fair enough, if you think it is an anti-pattern then perhaps you can > > > suggest an alternative that allows one to write tests that are (in > > > descending order of priority): > > > > > > 1. Quick and easy to setup and teardown > > > 2. Repeatably testable and debuggable in an IDE without having to > resort > > to > > > the external build tool > > > 3. Testable in parallel, i.e. having more than one build running on a > CI > > > server, so need some way to avoid port clashes > > > 4. Cross-platform - i.e. run the exact same build sequence on multiple > > > OSes. > > > > > > > We also embed for testing for all the reasons above with good success. > In > > fact - I like the idea of the server product directly supporting > alternate > > mains - although this may be more burden on zk devs initially. This > would > > allow an org to write their own main that plugs in an uncaught handler > and > > does whatever is appropriate in their deploy environment. > > > > > > > > > > Ishaaq > > > > > > On 16 April 2012 22:55, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > > > > > I believe that this change is inspired by someone that runs zk > > embedded. > > > > Personally I'm not moved by the testing argument, embedding the > server > > > for > > > > testing is a bit of an anti pattern in my mind. > > > > > > > > From my phone > > > > On Apr 15, 2012 11:20 PM, "Ishaaq Chandy" <[EMAIL PROTECTED]> wrote: > > > > > > > > > I'd go so far as to say that even the server-code should avoid > > > > System.exit. > > > > > Just because it is "meant" to be a standalone system doesn't mean > > that > > > > code > > > > > that makes it impossible to embed it should be encouraged. > > > > > > > > > > For e.g, we embed a local version of ZK to be used inside our unit > > > tests. > > > > > This makes it much easier for us to control ZK to coincide with > test > > > > > expectations as well as making for much faster build times. It > would > > > be a > > > > > shame if the embedded ZK started killing the JVM.
+
Camille Fournier 2012-04-16, 18:40
Ted Dunning 2012-04-16, 19:53
It would be lovely to have a ZK test jar that facilitates mock testing like this. I have done as Camille has, but the result is that we have probably replicated work without resulting in a usable framework.
Shame on me at least.
The mockup structure for tests *inside* ZK is far better than the structure available outside of ZK. Unfortunately, every time I have taken a run at this, it has exceeded my fun-time budget without significant publishable results.
2012/4/16 Camille Fournier <[EMAIL PROTECTED]>
> I've done extensive ZK client testing mostly using mocks with great > success. I agree that there's cases when embedding a ZK is probably the > easiest way to go, but it is actually much easier to simulate most failure > cases in ZK via mocks than running an embedded server. It does require more > up-front developer thought to set up the tests, but that thought pays > dividends in that you actually are then forced to understand the possible > failures and simulate them. > It seems ultimately like a flag is the right way to go here, so I'll leave > that feedback on the ticket. > > C > > 2012/4/16 Scott Fines <[EMAIL PROTECTED]> > > > Not to start a flame war here.. > > > > Mocks are good for a lot of purposes, but I can think of a few cases > where > > testing failure states is a lot easier if you are actively contacting a > ZK > > server (like testing a failed client connection's consequences). To that > > end, it seems like a good idea to support this kind of testing. So far, > the > > easiest way seems to be embedding a ZK Server. > > > > On the other hand, it doesn't seem likely that an active, production ZK > > cluster is going to want any kind of unusual behavior. It's almost always > > better to have some kind of behavior that is very well defined and > > predictable happen, even if that behavior is catastrophic. In this case, > > having a production server shut itself off whenever it doesn't know how > to > > recover is very predictable, and can be designed around. Leaving that > > cluster in an awkward state that may or may not behave predictably is > much > > harder to work with. > > > > I personally can't see much else that you'd want to do in this scenario > > that falls far enough outside these two patterns that it's worth > supporting > > custom handlers. It seems like the support time would be better spent > > developing and planning around these two states. > > > > Scott > > > > > > > > On Mon, Apr 16, 2012 at 1:09 PM, John Sirois <[EMAIL PROTECTED]> > > wrote: > > > > > On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy <[EMAIL PROTECTED]> > > wrote: > > > > > > > Fair enough, if you think it is an anti-pattern then perhaps you can > > > > suggest an alternative that allows one to write tests that are (in > > > > descending order of priority): > > > > > > > > 1. Quick and easy to setup and teardown > > > > 2. Repeatably testable and debuggable in an IDE without having to > > resort > > > to > > > > the external build tool > > > > 3. Testable in parallel, i.e. having more than one build running on a > > CI > > > > server, so need some way to avoid port clashes > > > > 4. Cross-platform - i.e. run the exact same build sequence on > multiple > > > > OSes. > > > > > > > > > > We also embed for testing for all the reasons above with good success. > > In > > > fact - I like the idea of the server product directly supporting > > alternate > > > mains - although this may be more burden on zk devs initially. This > > would > > > allow an org to write their own main that plugs in an uncaught handler > > and > > > does whatever is appropriate in their deploy environment. > > > > > > > > > > > > > > Ishaaq > > > > > > > > On 16 April 2012 22:55, Camille Fournier <[EMAIL PROTECTED]> wrote: > > > > > > > > > I believe that this change is inspired by someone that runs zk > > > embedded. > > > > > Personally I'm not moved by the testing argument, embedding the > > server > > > > for > > > > > testing is a bit of an anti pattern in my mind.
+
Ted Dunning 2012-04-16, 19:53
Jeremy Stribling 2012-04-16, 18:38
I would be very happy with a solution that allows a custom uncaught exception handler -- that's how we are working around this issue now, by installing a new default uncaught exception handler after we're sure NIOServerCnxnFactory has installed its own. I do think by default ZK should shut down predictably if it hits an error (with the possible exception of ThreadDeath), but turning it off by default until a major release seems reasonable if we allow a custom handler.
I don't know that I'm the right person to implement it though, not being super strong in Java or in the design of user inputs into ZK, so if anyone wants to take a crack at this, feel free.
On 04/16/2012 11:27 AM, Scott Fines wrote: > Not to start a flame war here.. > > Mocks are good for a lot of purposes, but I can think of a few cases where > testing failure states is a lot easier if you are actively contacting a ZK > server (like testing a failed client connection's consequences). To that > end, it seems like a good idea to support this kind of testing. So far, the > easiest way seems to be embedding a ZK Server. > > On the other hand, it doesn't seem likely that an active, production ZK > cluster is going to want any kind of unusual behavior. It's almost always > better to have some kind of behavior that is very well defined and > predictable happen, even if that behavior is catastrophic. In this case, > having a production server shut itself off whenever it doesn't know how to > recover is very predictable, and can be designed around. Leaving that > cluster in an awkward state that may or may not behave predictably is much > harder to work with. > > I personally can't see much else that you'd want to do in this scenario > that falls far enough outside these two patterns that it's worth supporting > custom handlers. It seems like the support time would be better spent > developing and planning around these two states. > > Scott > > > > On Mon, Apr 16, 2012 at 1:09 PM, John Sirois<[EMAIL PROTECTED]> wrote: > >> On Mon, Apr 16, 2012 at 10:52 AM, Ishaaq Chandy<[EMAIL PROTECTED]> wrote: >> >>> Fair enough, if you think it is an anti-pattern then perhaps you can >>> suggest an alternative that allows one to write tests that are (in >>> descending order of priority): >>> >>> 1. Quick and easy to setup and teardown >>> 2. Repeatably testable and debuggable in an IDE without having to resort >> to >>> the external build tool >>> 3. Testable in parallel, i.e. having more than one build running on a CI >>> server, so need some way to avoid port clashes >>> 4. Cross-platform - i.e. run the exact same build sequence on multiple >>> OSes. >>> >> We also embed for testing for all the reasons above with good success. In >> fact - I like the idea of the server product directly supporting alternate >> mains - although this may be more burden on zk devs initially. This would >> allow an org to write their own main that plugs in an uncaught handler and >> does whatever is appropriate in their deploy environment. >> >> >>> Ishaaq >>> >>> On 16 April 2012 22:55, Camille Fournier<[EMAIL PROTECTED]> wrote: >>> >>>> I believe that this change is inspired by someone that runs zk >> embedded. >>>> Personally I'm not moved by the testing argument, embedding the server >>> for >>>> testing is a bit of an anti pattern in my mind. >>>> >>>> From my phone >>>> On Apr 15, 2012 11:20 PM, "Ishaaq Chandy"<[EMAIL PROTECTED]> wrote: >>>> >>>>> I'd go so far as to say that even the server-code should avoid >>>> System.exit. >>>>> Just because it is "meant" to be a standalone system doesn't mean >> that >>>> code >>>>> that makes it impossible to embed it should be encouraged. >>>>> >>>>> For e.g, we embed a local version of ZK to be used inside our unit >>> tests. >>>>> This makes it much easier for us to control ZK to coincide with test >>>>> expectations as well as making for much faster build times. It would >>> be a >>>>> shame if the embedded ZK started killing the JVM.
+
Jeremy Stribling 2012-04-16, 18:38
Shelley, Ryan 2012-04-13, 18:22
Just my 2 centsŠ is the error code 1 the correct error code to return to the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1) may be called. It may make sense to either re-use that error code, or use a different one (if 1 is already used elsewhere for a different type of error, like "Invalid arguments" during start-up, for example). If the error isn't an OOME, is there any clean-up ZK needs to do to maybe inform a cluster it's going down abruptly (maybe to gracefully begin a leader re-election if necessary, for example)? I'm +1 to fail-fast behavior. Ryan On 4/13/12 8:15 AM, "Scott Fines" <[EMAIL PROTECTED]> wrote: >On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM >for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd >args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't >remember off the top of my head). Will these triggers still be fired, or >will the catch-all prevent them? > >I'm still +1 for the change no matter what, but it's probably a good idea >to mention that in the docs if they don't work. > >Scott > >On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier ><[EMAIL PROTECTED]>wrote: > >> Hi everyone, >> >> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and >>I'd >> like some feedback from the user base on it. >> https://issues.apache.org/jira/browse/ZOOKEEPER-1442>> >> The current behavior of ZK when we get an uncaught exception is to log >>it >> and try to move on. This is arguably not the right thing to do, and will >> possibly cause ZK to limp along with a bad VM (say, in an OOM state) for >> longer than it should. >> The patch proposes that when we get an instance of java.lang.Error, we >> should do a system.exit to fast-fail the process. With the possible >> exception of ThreadDeath (which may or may not be an unrecoverable >>system >> state depending on the thread), I think this makes sense, but I would >>like >> to hear from others if they have an opinion. I think it's better to kill >> the process and let your monitoring services detect process death (and >>thus >> restart) than possibly linger unresponsive for a while, are there >>scenarios >> that we're missing where this error can occur and you wouldn't want the >> process killed? >> >> Thanks for your feedback, >> >> Camille >>
+
Shelley, Ryan 2012-04-13, 18:22
Shelley, Ryan 2012-04-13, 18:37
And to go just a bit further with this, would this be considered a non-backward compatible change to the core expectations of the system (even if it's patching a "bug")? It changes the "expected" behavior of the application from an undefined possible zombie state to fail-fast. Theoretically, that might constitute a version bump ( http://semver.org), unless you parameterize the behavior so that a user, who encounters this behavior outside of their existing expectations, could turn off the fail-fast behavior by setting an appropriate flag. I'm not an expert in ZK whatsoever, but maybe there's some runtime operation that occurs and generates an unchecked exception and the system is fine with that, but it's only really experienced at runtime in exceptional cases. If the system fails-fast, without a way to disable fail-fast, you may have to release another patch promptly (instead of simply saying "disable fail-fast for now, we'll fix the larger issue in the next release"). Maybe that's excessive, but just putting it out there anyway. On 4/13/12 11:22 AM, "Shelley, Ryan" <[EMAIL PROTECTED]> wrote: >Just my 2 centsŠ is the error code 1 the correct error code to return to >the OS? I'm just curious if anywhere else in ZooKeeper a System.exit(1) >may be called. It may make sense to either re-use that error code, or use >a different one (if 1 is already used elsewhere for a different type of >error, like "Invalid arguments" during start-up, for example). > >If the error isn't an OOME, is there any clean-up ZK needs to do to maybe >inform a cluster it's going down abruptly (maybe to gracefully begin a >leader re-election if necessary, for example)? > >I'm +1 to fail-fast behavior. > >Ryan > >On 4/13/12 8:15 AM, "Scott Fines" <[EMAIL PROTECTED]> wrote: > >>On some JVMs (the HotSpot for sure, but maybe others too?) there's a JVM >>for performing actions on OutOfMemoryErrors (-XX:OnOutOfMemoryError="<cmd >>args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I can't >>remember off the top of my head). Will these triggers still be fired, or >>will the catch-all prevent them? >> >>I'm still +1 for the change no matter what, but it's probably a good idea >>to mention that in the docs if they don't work. >> >>Scott >> >>On Fri, Apr 13, 2012 at 10:09 AM, Camille Fournier >><[EMAIL PROTECTED]>wrote: >> >>> Hi everyone, >>> >>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, and >>>I'd >>> like some feedback from the user base on it. >>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442>>> >>> The current behavior of ZK when we get an uncaught exception is to log >>>it >>> and try to move on. This is arguably not the right thing to do, and >>>will >>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) >>>for >>> longer than it should. >>> The patch proposes that when we get an instance of java.lang.Error, we >>> should do a system.exit to fast-fail the process. With the possible >>> exception of ThreadDeath (which may or may not be an unrecoverable >>>system >>> state depending on the thread), I think this makes sense, but I would >>>like >>> to hear from others if they have an opinion. I think it's better to >>>kill >>> the process and let your monitoring services detect process death (and >>>thus >>> restart) than possibly linger unresponsive for a while, are there >>>scenarios >>> that we're missing where this error can occur and you wouldn't want the >>> process killed? >>> >>> Thanks for your feedback, >>> >>> Camille >>> >
+
Shelley, Ryan 2012-04-13, 18:37
Shelley, Ryan 2012-04-13, 22:00
System.exit(int) is developer specific. Traditionally, it should be non-zero for abnormal operation (so really it should be System.exit(-1)). Maybe it could just be parameterized entirely. Apply the UncaughtExceptionHandler only if the option is enabled. --failfast=true/false (defaults to false, until a future major release) Ryan On 4/13/12 2:38 PM, "Michi Mutsuzaki" <[EMAIL PROTECTED]> wrote: >By the way, grepping the source code, I see we use various exit codes. >Does anybody know what they all mean? > >--Michi > >On Fri, Apr 13, 2012 at 2:31 PM, Michi Mutsuzaki <[EMAIL PROTECTED]> >wrote: >> Ah I see. I guess log and System.exit(1) is the best we can do then. >> >> --Michi >> >> On Fri, Apr 13, 2012 at 2:25 PM, Jeremy Stribling <[EMAIL PROTECTED]> >>wrote: >>> On 04/13/2012 01:19 PM, Michi Mutsuzaki wrote: >>>> >>>> I agree with both Scott's and Ryan's points. I think it makes to: >>>> >>>> 1. Make this behavior configurable (with default being "turned off") >>>>to >>>> preserve backward compatibility. >>>> 2. Re-throw the exception instead of exiting with System.exit(1) so >>>>that >>>> users can use flags like -XX:+HeapDumpOnOutOfMemoryError. >>> >>> >>> I don't think re-throwing exceptions from an uncaught exception >>>handler is >>> an option: >>> >>> http://docs.oracle.com/javase/6/docs/api/java/lang/Thread.UncaughtExcept>>>ionHandler.html#uncaughtException%28java.lang.Thread,%20java.lang.Throwa >>>ble%29 >>> >>> " Any exception thrown by this method will be ignored by the Java >>>Virtual >>> Machine." >>> >>> >>>> --Michi >>>> ________________________________________ >>>> From: Scott Fines [[EMAIL PROTECTED]] >>>> Sent: Friday, April 13, 2012 8:15 AM >>>> To: [EMAIL PROTECTED] >>>> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] >>>> Subject: Re: Input on a change >>>> >>>> On some JVMs (the HotSpot for sure, but maybe others too?) there's a >>>>JVM >>>> for performing actions on OutOfMemoryErrors >>>>(-XX:OnOutOfMemoryError="<cmd >>>> args>, -XX:+HeapDumpOnOutOfMemoryError and maybe some others that I >>>>can't >>>> remember off the top of my head). Will these triggers still be fired, >>>>or >>>> will the catch-all prevent them? >>>> >>>> I'm still +1 for the change no matter what, but it's probably a good >>>>idea >>>> to mention that in the docs if they don't work. >>>> >>>> Scott >>>> >>>> On Fri, Apr 13, 2012 at 10:09 AM, Camille >>>> Fournier<[EMAIL PROTECTED]>wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I'm trying to evaluate a patch that Jeremy Stribling has submitted, >>>>>and >>>>> I'd >>>>> like some feedback from the user base on it. >>>>> https://issues.apache.org/jira/browse/ZOOKEEPER-1442>>>>> >>>>> The current behavior of ZK when we get an uncaught exception is to >>>>>log it >>>>> and try to move on. This is arguably not the right thing to do, and >>>>>will >>>>> possibly cause ZK to limp along with a bad VM (say, in an OOM state) >>>>>for >>>>> longer than it should. >>>>> The patch proposes that when we get an instance of java.lang.Error, >>>>>we >>>>> should do a system.exit to fast-fail the process. With the possible >>>>> exception of ThreadDeath (which may or may not be an unrecoverable >>>>>system >>>>> state depending on the thread), I think this makes sense, but I would >>>>> like >>>>> to hear from others if they have an opinion. I think it's better to >>>>>kill >>>>> the process and let your monitoring services detect process death >>>>>(and >>>>> thus >>>>> restart) than possibly linger unresponsive for a while, are there >>>>> scenarios >>>>> that we're missing where this error can occur and you wouldn't want >>>>>the >>>>> process killed? >>>>> >>>>> Thanks for your feedback, >>>>> >>>>> Camille >>>>> >>>
+
Shelley, Ryan 2012-04-13, 22:00
+
Jordan Zimmerman 2012-04-13, 23:18
|
|