Bryan Baugher 2013-08-14, 16:52
Joel Koshy 2013-08-15, 00:06
The crash is actually just a call to shutdown. We think this is the right
thing to do, though I agree it is unintuitive. Here is why. When you get an
out of space error it is likely that the operating system did a partial
write, leaving you with a corrupt log. Furthermore it is possible that
space will free up at which point more writes on the log could succeed so
you wouldn't even know there was a problem but all your consumers would hit
this data and choke.
By "crashing" the node we ensure that recovery is run on the log to bring
it into a consistent state.
Theoretically we could leave the node up accepting reads but rejecting
writes while attempting to recover the log. But there are a bunch of
problems with this. But this is very complex. Likely if you are out of
space you are just going to keep getting writes, and running out of space
again and then running recovery and so on. This kind of crazy loop is much
worse then just needing to bring the node back up.
Alternately we could leave the node up but go into some kind of
write-rejecting mode forever. But this would still require that you restart
the node, and we would have to implement that write-rejecting node.
On Wed, Aug 14, 2013 at 9:52 AM, Bryan Baugher <[EMAIL PROTECTED]> wrote:
> This is more of a thought question than a problem that I need support for.
> I have trying out Kafka 0.8.0-beta1 with replication. For our user case we
> want to try and guarantee that our consumers will see all messages even if
> they have fallen greatly behind the broker/producer. For this reason I
> wanted to know how the broker would react when the filesystem it writes its
> messages to is full. What I found was that the broker crashes and cannot be
> started until the filesystem has space again.
> Is there or would it make sense to provide configuration allowing the
> broker to reject writes in this case rather than crashing, electing a new
> leader and attempting the write again? I can clearly understand the use
> case that we don't want to 'lose' messages from the producer and I could
> also see how lack of filesystem space could be considered a machine
> failure, but with replication I would think if you are running out of space
> on 1 broker you are likely running out of space on others.
Jason Rosenberg 2013-08-15, 17:20
Jay Kreps 2013-08-15, 17:58
Jason Rosenberg 2013-08-15, 18:12
Jay Kreps 2013-08-15, 18:31
Jason Rosenberg 2013-08-16, 20:47