Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Accumulo, mail # user - Suspension


Copy link to this message
-
Re: Suspension
Aaron Cordova 2012-02-15, 16:55
I don't know if a general process-starting service belongs in the Accumulo project .. but, it is cumbersome to run a large distributed service without some such service. Are there existing things out there that come close? I'm sure there are tools that monitor machines for missing processes that can ssh in and restart them ..

There are systems that are designed according to principles such as "CrashOnlySoftware" and Erlang's "LetItCrash" in which processes are stopped by crashing and startup always involves recovery, which is kind of elegant since you don't have to design non-crash and non-recovery stop and start sequences. However, I think Accumulo is not quite designed that way right now and it might be a lot of work to make it that way and it might not be a good idea anyway.

I also anticipate that making all processes, including ZooKeeper, able to continue operating in the presence of large gaps of time might be a lot of work and might destabilize monitoring and recovery mechanisms already in place. It would only be worth doing if it became clear that it could be done cleanly, and while keeping the standard, non-laptop, and non-VM/non-EC2 mode of operation intact. I value stability over nice-to-have but outside-the-core-use-case features at this point.
On Feb 15, 2012, at 11:38 AM, Joey Echeverria wrote:

> Systems I've used that include automatic restart usually have a limit of restarting 3-4 times in a row, before giving up. It's nice if you can have a time out on that counter so you retain the auto-restart capability if you need to suspend a few days from now.
>
> I've also worked on a system where process restarts were the way we handled failures. ZooKeeper state can be tricky to recover if you've been down for long enough for your session to expire. I found it easier to just kill the process and go through the full "boot-up" logic. In that system, we used the shell scripts launching the JVMs handle the restart with the restart policy being dictated by exit code.
>
> -Joey
>
> On Wed, Feb 15, 2012 at 11:16 AM, John Vines <[EMAIL PROTECTED]> wrote:
> There are too many cases where a node legitimately died and we do not want it constantly coming back and bogging things down. How do you design it to restart the accidentally deaths but not the deserves it deaths?
>
> On Feb 15, 2012 11:11 AM, "Adam Fuchs" <[EMAIL PROTECTED]> wrote:
> This isn't really just a laptop problem. We also see hiccups in clusters (admins accidentally the whole network, etc.) that we would want to automatically recover from. I think having self-restarting processes could be generally useful.
>
> I think that an option of not using zookeeper timeouts might lead to abuse, and could be very bad for stability under rare failure modes. We make a lot of assumptions throughout the code about these timeouts, and we would have to reconsider a large part of that model.
>
> Adam
>
>
> On Wed, Feb 15, 2012 at 10:56 AM, Billie J Rinaldi <[EMAIL PROTECTED]> wrote:
> On Wednesday, February 15, 2012 10:38:41 AM, "Aaron Cordova" <[EMAIL PROTECTED]> wrote:
> > Such an option would have to be very conspicuous so that users don't
> > accidentally enable it and then wonder why bad tablet servers aren't
> > removed automatically from the cluster.
>
> We could call it laptop.mode.
>
> Billie
>
>
>
>
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
>