The crash is actually just a call to shutdown. We think this is the right
thing to do, though I agree it is unintuitive. Here is why. When you get an
out of space error it is likely that the operating system did a partial
write, leaving you with a corrupt log. Furthermore it is possible that
space will free up at which point more writes on the log could succeed so
you wouldn't even know there was a problem but all your consumers would hit
this data and choke.
By "crashing" the node we ensure that recovery is run on the log to bring
it into a consistent state.
Theoretically we could leave the node up accepting reads but rejecting
writes while attempting to recover the log. But there are a bunch of
problems with this. But this is very complex. Likely if you are out of
space you are just going to keep getting writes, and running out of space
again and then running recovery and so on. This kind of crazy loop is much
worse then just needing to bring the node back up.
Alternately we could leave the node up but go into some kind of
write-rejecting mode forever. But this would still require that you restart
the node, and we would have to implement that write-rejecting node.
On Wed, Aug 14, 2013 at 9:52 AM, Bryan Baugher <[EMAIL PROTECTED]> wrote: