-Re: Unpacking "strong durability and fault-tolerance guarantees"
Philip O'Toole 2013-07-07, 19:51
Disclaimer: I only have experience right now with 0.7.2
On Sun, Jul 7, 2013 at 11:35 AM, David James <[EMAIL PROTECTED]> wrote:
> Sorry for the long email, but I've tried to keep it organized, at least.
> "Kafka has a modern cluster-centric design that offers strong
> durability and fault-tolerance guarantees." and "Messages are
> persisted on disk and replicated within the cluster to prevent data
> loss." according to http://kafka.apache.org/.
> I'm trying to understand what this means in some detail. So, two questions.
> 1. Fault-Tolerance
> If a Broker in a Kafka cluster fails (the EC2 instance dies), what
> happens? After, let's say I add a new Broker to the cluster (that my
> responsibility, not Kafka's). What happens when it rejoins?
> To be more particular, if the cluster consists of a Zookeeper and B
> (3, for example) Brokers, can a Kafka system guarantee to tolerate up
> to B-1 (2, for example) Broker failures?
> 2. Durability at an application level
> What are the guarantees about durability, at an application level, in
> practice? By "application level" I mean guarantees that a produced
> message gets consumed and acted upon by an application that uses
> Kafka. My understanding at present is that Kafka does not make these
> kinds of guarantees because there are no acks. So, it is up to the
> application developer to handle it. Is this right?
> Here's my understanding: Having messages persisted on disk and
> replicated is why Kafka has durability guarantees. But, from an
> application perspective, what happens when a consumer pulls a message
> but fails before acting on it? That would update the Kafka consumer
> offset, right? So, without some thinking and planning ahead on the
> Kafka system design, the application's consumers would not have a way
> of knowing that a message was not actually processed.
Don't update the offset until message has been fully processed. This
means you need to build downstream systems that can accept messages
being replayed, since a message may be processed but the consumer
crash before the offset it updated. Or at least have a process in
place to deal with clean-up, in the event of a crash.
> Conclusion / Last Question
> I'm interested in making the chance of message loss minimal, at a
> system level. Any pointers on what to read or think about would be