Systems I've used that include automatic restart usually have a limit of
restarting 3-4 times in a row, before giving up. It's nice if you can have
a time out on that counter so you retain the auto-restart capability if you
need to suspend a few days from now.
I've also worked on a system where process restarts were the way we handled
failures. ZooKeeper state can be tricky to recover if you've been down for
long enough for your session to expire. I found it easier to just kill the
process and go through the full "boot-up" logic. In that system, we used
the shell scripts launching the JVMs handle the restart with the restart
policy being dictated by exit code.
On Wed, Feb 15, 2012 at 11:16 AM, John Vines <[EMAIL PROTECTED]> wrote:
> There are too many cases where a node legitimately died and we do not want
> it constantly coming back and bogging things down. How do you design it to
> restart the accidentally deaths but not the deserves it deaths?
> On Feb 15, 2012 11:11 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote:
>> This isn't really just a laptop problem. We also see hiccups in clusters
>> (admins accidentally the whole network, etc.) that we would want to
>> automatically recover from. I think having self-restarting processes could
>> be generally useful.
>> I think that an option of not using zookeeper timeouts might lead to
>> abuse, and could be very bad for stability under rare failure modes. We
>> make a lot of assumptions throughout the code about these timeouts, and we
>> would have to reconsider a large part of that model.
>> On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi <
>> [EMAIL PROTECTED]> wrote:
>>> On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" <
>>> [EMAIL PROTECTED]> wrote:
>>> > Such an option would have to be very conspicuous so that users don't
>>> > accidentally enable it and then wonder why bad tablet servers aren't
>>> > removed automatically from the cluster.
>>> We could call it laptop.mode.