|
Aaron Cordova
2012-02-15, 14:54
Adam Fuchs
2012-02-15, 15:26
John Vines
2012-02-15, 15:31
Aaron Cordova
2012-02-15, 15:38
Billie J Rinaldi
2012-02-15, 15:56
Adam Fuchs
2012-02-15, 16:10
John Vines
2012-02-15, 16:10
John Vines
2012-02-15, 16:16
Joey Echeverria
2012-02-15, 16:38
Aaron Cordova
2012-02-15, 16:55
Adam Fuchs
2012-02-15, 17:00
David Medinets
2012-02-15, 17:12
Aaron Cordova
2012-02-15, 17:20
John Vines
2012-02-15, 20:06
|
-
SuspensionAaron Cordova 2012-02-15, 14:54
EC2 as well as laptop users would be interested in making Accumulo 'suspendable'. The self-monitoring features end up killing off processes upon awakening. Perhaps this could be implemented by a simple switch that tells Accumulo not to worry about abandoning processes that don't report, that can be enabled before suspension and disabled after .. or simply left enabled for stand-alone laptop users.
Does it make sense to make it possible to suspend a running Accumulo instance, or should this simply be discouraged and made well known?
-
Re: SuspensionAdam Fuchs 2012-02-15, 15:26
I think this makes a lot of sense. I use Accumulo enough on a laptop to be
annoyed at how often I have to run start-all.sh. One way we could do this is to have a separate daemon process restart accumulo processes anytime they go down. I think log recovery is almost as efficient as any other way of suspending memory to disk, and it doesn't add any extra complexity to the code base. The only other concern is having the daemon restart a process that should actually be down, and we would have to work out the model for that. Adam On Wed, Feb 15, 2012 at 9:54 AM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > EC2 as well as laptop users would be interested in making Accumulo > 'suspendable'. The self-monitoring features end up killing off processes > upon awakening. Perhaps this could be implemented by a simple switch that > tells Accumulo not to worry about abandoning processes that don't report, > that can be enabled before suspension and disabled after .. or simply left > enabled for stand-alone laptop users. > > Does it make sense to make it possible to suspend a running Accumulo > instance, or should this simply be discouraged and made well known? > >
-
Re: SuspensionJohn Vines 2012-02-15, 15:31
That sounds to hacky. Why not just have a Config option for whether zk
timeouts are heeded? On Feb 15, 2012 10:26 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote: > I think this makes a lot of sense. I use Accumulo enough on a laptop to be > annoyed at how often I have to run start-all.sh. > > One way we could do this is to have a separate daemon process restart > accumulo processes anytime they go down. I think log recovery is almost as > efficient as any other way of suspending memory to disk, and it doesn't add > any extra complexity to the code base. The only other concern is having the > daemon restart a process that should actually be down, and we would have to > work out the model for that. > > Adam > > > On Wed, Feb 15, 2012 at 9:54 AM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > >> EC2 as well as laptop users would be interested in making Accumulo >> 'suspendable'. The self-monitoring features end up killing off processes >> upon awakening. Perhaps this could be implemented by a simple switch that >> tells Accumulo not to worry about abandoning processes that don't report, >> that can be enabled before suspension and disabled after .. or simply left >> enabled for stand-alone laptop users. >> >> Does it make sense to make it possible to suspend a running Accumulo >> instance, or should this simply be discouraged and made well known? >> >> >
-
Re: SuspensionAaron Cordova 2012-02-15, 15:38
Such an option would have to be very conspicuous so that users don't accidentally enable it and then wonder why bad tablet servers aren't removed automatically from the cluster.
It would also require some thought to make sure that large gaps in all processes' consciousnesses (5 s's in that word!) don't cause other undesirable effects. On Feb 15, 2012, at 10:31 AM, John Vines wrote: > That sounds to hacky. Why not just have a Config option for whether zk timeouts are heeded? > > On Feb 15, 2012 10:26 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote: > I think this makes a lot of sense. I use Accumulo enough on a laptop to be annoyed at how often I have to run start-all.sh. > > One way we could do this is to have a separate daemon process restart accumulo processes anytime they go down. I think log recovery is almost as efficient as any other way of suspending memory to disk, and it doesn't add any extra complexity to the code base. The only other concern is having the daemon restart a process that should actually be down, and we would have to work out the model for that. > > Adam > > > On Wed, Feb 15, 2012 at 9:54 AM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > EC2 as well as laptop users would be interested in making Accumulo 'suspendable'. The self-monitoring features end up killing off processes upon awakening. Perhaps this could be implemented by a simple switch that tells Accumulo not to worry about abandoning processes that don't report, that can be enabled before suspension and disabled after .. or simply left enabled for stand-alone laptop users. > > Does it make sense to make it possible to suspend a running Accumulo instance, or should this simply be discouraged and made well known? > >
-
Re: SuspensionBillie J Rinaldi 2012-02-15, 15:56
On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" <[EMAIL PROTECTED]> wrote:
> Such an option would have to be very conspicuous so that users don't > accidentally enable it and then wonder why bad tablet servers aren't > removed automatically from the cluster. We could call it laptop.mode. Billie
-
Re: SuspensionAdam Fuchs 2012-02-15, 16:10
This isn't really just a laptop problem. We also see hiccups in clusters
(admins accidentally the whole network, etc.) that we would want to automatically recover from. I think having self-restarting processes could be generally useful. I think that an option of not using zookeeper timeouts might lead to abuse, and could be very bad for stability under rare failure modes. We make a lot of assumptions throughout the code about these timeouts, and we would have to reconsider a large part of that model. Adam On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi < [EMAIL PROTECTED]> wrote: > On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" < > [EMAIL PROTECTED]> wrote: > > Such an option would have to be very conspicuous so that users don't > > accidentally enable it and then wonder why bad tablet servers aren't > > removed automatically from the cluster. > > We could call it laptop.mode. > > Billie >
-
Re: SuspensionJohn Vines 2012-02-15, 16:10
On Feb 15, 2012 10:57 AM, "Billie J Rinaldi" <[EMAIL PROTECTED]>
wrote: > > On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" < [EMAIL PROTECTED]> wrote: > > Such an option would have to be very conspicuous so that users don't > > accidentally enable it and then wonder why bad tablet servers aren't > > removed automatically from the cluster. > > We could call it laptop.mode. +1 > > Billie
-
Re: SuspensionJohn Vines 2012-02-15, 16:16
There are too many cases where a node legitimately died and we do not want
it constantly coming back and bogging things down. How do you design it to restart the accidentally deaths but not the deserves it deaths? On Feb 15, 2012 11:11 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote: > This isn't really just a laptop problem. We also see hiccups in clusters > (admins accidentally the whole network, etc.) that we would want to > automatically recover from. I think having self-restarting processes could > be generally useful. > > I think that an option of not using zookeeper timeouts might lead to > abuse, and could be very bad for stability under rare failure modes. We > make a lot of assumptions throughout the code about these timeouts, and we > would have to reconsider a large part of that model. > > Adam > > > On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi < > [EMAIL PROTECTED]> wrote: > >> On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" < >> [EMAIL PROTECTED]> wrote: >> > Such an option would have to be very conspicuous so that users don't >> > accidentally enable it and then wonder why bad tablet servers aren't >> > removed automatically from the cluster. >> >> We could call it laptop.mode. >> >> Billie >> > >
-
Re: SuspensionJoey Echeverria 2012-02-15, 16:38
Systems I've used that include automatic restart usually have a limit of
restarting 3-4 times in a row, before giving up. It's nice if you can have a time out on that counter so you retain the auto-restart capability if you need to suspend a few days from now. I've also worked on a system where process restarts were the way we handled failures. ZooKeeper state can be tricky to recover if you've been down for long enough for your session to expire. I found it easier to just kill the process and go through the full "boot-up" logic. In that system, we used the shell scripts launching the JVMs handle the restart with the restart policy being dictated by exit code. -Joey On Wed, Feb 15, 2012 at 11:16 AM, John Vines <[EMAIL PROTECTED]> wrote: > There are too many cases where a node legitimately died and we do not want > it constantly coming back and bogging things down. How do you design it to > restart the accidentally deaths but not the deserves it deaths? > On Feb 15, 2012 11:11 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote: > >> This isn't really just a laptop problem. We also see hiccups in clusters >> (admins accidentally the whole network, etc.) that we would want to >> automatically recover from. I think having self-restarting processes could >> be generally useful. >> >> I think that an option of not using zookeeper timeouts might lead to >> abuse, and could be very bad for stability under rare failure modes. We >> make a lot of assumptions throughout the code about these timeouts, and we >> would have to reconsider a large part of that model. >> >> Adam >> >> >> On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi < >> [EMAIL PROTECTED]> wrote: >> >>> On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" < >>> [EMAIL PROTECTED]> wrote: >>> > Such an option would have to be very conspicuous so that users don't >>> > accidentally enable it and then wonder why bad tablet servers aren't >>> > removed automatically from the cluster. >>> >>> We could call it laptop.mode. >>> >>> Billie >>> >> >> -- Joseph Echeverria Cloudera, Inc. 443.305.9434
-
Re: SuspensionAaron Cordova 2012-02-15, 16:55
I don't know if a general process-starting service belongs in the Accumulo project .. but, it is cumbersome to run a large distributed service without some such service. Are there existing things out there that come close? I'm sure there are tools that monitor machines for missing processes that can ssh in and restart them ..
There are systems that are designed according to principles such as "CrashOnlySoftware" and Erlang's "LetItCrash" in which processes are stopped by crashing and startup always involves recovery, which is kind of elegant since you don't have to design non-crash and non-recovery stop and start sequences. However, I think Accumulo is not quite designed that way right now and it might be a lot of work to make it that way and it might not be a good idea anyway. I also anticipate that making all processes, including ZooKeeper, able to continue operating in the presence of large gaps of time might be a lot of work and might destabilize monitoring and recovery mechanisms already in place. It would only be worth doing if it became clear that it could be done cleanly, and while keeping the standard, non-laptop, and non-VM/non-EC2 mode of operation intact. I value stability over nice-to-have but outside-the-core-use-case features at this point. On Feb 15, 2012, at 11:38 AM, Joey Echeverria wrote: > Systems I've used that include automatic restart usually have a limit of restarting 3-4 times in a row, before giving up. It's nice if you can have a time out on that counter so you retain the auto-restart capability if you need to suspend a few days from now. > > I've also worked on a system where process restarts were the way we handled failures. ZooKeeper state can be tricky to recover if you've been down for long enough for your session to expire. I found it easier to just kill the process and go through the full "boot-up" logic. In that system, we used the shell scripts launching the JVMs handle the restart with the restart policy being dictated by exit code. > > -Joey > > On Wed, Feb 15, 2012 at 11:16 AM, John Vines <[EMAIL PROTECTED]> wrote: > There are too many cases where a node legitimately died and we do not want it constantly coming back and bogging things down. How do you design it to restart the accidentally deaths but not the deserves it deaths? > > On Feb 15, 2012 11:11 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote: > This isn't really just a laptop problem. We also see hiccups in clusters (admins accidentally the whole network, etc.) that we would want to automatically recover from. I think having self-restarting processes could be generally useful. > > I think that an option of not using zookeeper timeouts might lead to abuse, and could be very bad for stability under rare failure modes. We make a lot of assumptions throughout the code about these timeouts, and we would have to reconsider a large part of that model. > > Adam > > > On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi <[EMAIL PROTECTED]> wrote: > On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" <[EMAIL PROTECTED]> wrote: > > Such an option would have to be very conspicuous so that users don't > > accidentally enable it and then wonder why bad tablet servers aren't > > removed automatically from the cluster. > > We could call it laptop.mode. > > Billie > > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >
-
Re: SuspensionAdam Fuchs 2012-02-15, 17:00
I think we would start out by enumerating the cases in which processes die
and we want them to stay dead, and then consider the repercussions of trying to restart them in those cases. What cases can you think of in this space? Here's my short list: 1. Logger dies due to running out of disk space. Restarting it should be safe because it checks this condition every time it starts? 2. A node is behaving "wonkily" and we choose to remove it from the cluster. In a manual override condition we can just kill the restarting daemon. That would take care of restarting assuming we can log in on that node. If we can't log in, this could be accomplished through a decommission list in Zookeeper that the restarter checks before trying to launch. 3. A tablet server or logger gets overburdened and can't keep up with its load. As long as we wait for the cluster to rebalance, this should lead to a better balanced cluster. This is by no means a complete list, so please add to it. Adam On Wed, Feb 15, 2012 at 11:16 AM, John Vines <[EMAIL PROTECTED]> wrote: > There are too many cases where a node legitimately died and we do not want > it constantly coming back and bogging things down. How do you design it to > restart the accidentally deaths but not the deserves it deaths? > On Feb 15, 2012 11:11 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote: > >> This isn't really just a laptop problem. We also see hiccups in clusters >> (admins accidentally the whole network, etc.) that we would want to >> automatically recover from. I think having self-restarting processes could >> be generally useful. >> >> I think that an option of not using zookeeper timeouts might lead to >> abuse, and could be very bad for stability under rare failure modes. We >> make a lot of assumptions throughout the code about these timeouts, and we >> would have to reconsider a large part of that model. >> >> Adam >> >> >> On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi < >> [EMAIL PROTECTED]> wrote: >> >>> On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" < >>> [EMAIL PROTECTED]> wrote: >>> > Such an option would have to be very conspicuous so that users don't >>> > accidentally enable it and then wonder why bad tablet servers aren't >>> > removed automatically from the cluster. >>> >>> We could call it laptop.mode. >>> >>> Billie >>> >> >>
-
Re: SuspensionDavid Medinets 2012-02-15, 17:12
It seems like the conversation has wandered away from the main point -
marking a node as suspended instead of having a monitoring service discover that it is non-responsive. Would it possible to issue a command-line 'suspend' command. And then a 'resume' command when the user is ready to have the node back in the cluster?
-
Re: SuspensionAaron Cordova 2012-02-15, 17:20
Yeah, we don't want to let designing a restart service distract us from the suspension discussion.
Issuing a 'suspend' command sounds like a third option. So far we have: 1) run Accumulo in a mode that ignores long timeouts (perhaps enabled just before suspension) 2) let Accumulo die (no modification to Accumulo) and rely on a to-be-created restart service 3) issue a command to suspend processes before suspending the VM / OS Perhaps the 'suspend' command just enables ignorance of timeouts, but if you're gonna issue a command, you might as well just issue the 'shutdown' command. What's the start-up time like for large clusters now days? Also, what is the effect of taking all tables offline? On Feb 15, 2012, at 12:12 PM, David Medinets wrote: > It seems like the conversation has wandered away from the main point - > marking a node as suspended instead of having a monitoring service > discover that it is non-responsive. Would it possible to issue a > command-line 'suspend' command. And then a 'resume' command when the > user is ready to have the node back in the cluster?
-
Re: SuspensionJohn Vines 2012-02-15, 20:06
Perhaps we want a suspend option which provides the ZK timeouts one large
skew before it expects normal behavior again? John On Wed, Feb 15, 2012 at 12:20 PM, Aaron Cordova <[EMAIL PROTECTED]> wrote: > Yeah, we don't want to let designing a restart service distract us from > the suspension discussion. > > Issuing a 'suspend' command sounds like a third option. > > So far we have: > > 1) run Accumulo in a mode that ignores long timeouts (perhaps enabled just > before suspension) > 2) let Accumulo die (no modification to Accumulo) and rely on a > to-be-created restart service > 3) issue a command to suspend processes before suspending the VM / OS > > Perhaps the 'suspend' command just enables ignorance of timeouts, but if > you're gonna issue a command, you might as well just issue the 'shutdown' > command. > > What's the start-up time like for large clusters now days? > > Also, what is the effect of taking all tables offline? > > On Feb 15, 2012, at 12:12 PM, David Medinets wrote: > > > It seems like the conversation has wandered away from the main point - > > marking a node as suspended instead of having a monitoring service > > discover that it is non-responsive. Would it possible to issue a > > command-line 'suspend' command. And then a 'resume' command when the > > user is ready to have the node back in the cluster? > > |