Just out of curiosity, how does Kafka know when to remove/delete messages
from disk? Is this just done whenever a messages "falls off the end of the
(circular) buffer" or is there more to it than that? Also, when you say
that Kafka doesn't centrally maintain state (at all), does that mean
clients maintain their view of where (in the server held buffer) they're
currently at - kind of client-side cursor to the data? How does this
translate into no random I/O - you can't have mapped the entire
multi-terrabyte sized store into memory using mmap, so does this simply
mean that when that particular client is consuming data, you're relying on
the OS to page in the relevant bits of the data store and relying on
sendfile (under the covers) to flush that to the socket? Have I understood
this correctly? Sorry, BTW, if these are RTFM questions - I saw some bits
in the docs, but I must admit I've not trawled the code for answers as yet.
Kafka has a configurable rolling window of time it keeps the messages per
topic. The default is 7 days and after this time the messages are removed
from disk by the broker.
Correct, the consumers maintain their own state via what are known as
offsets. Also true that when producers/consumers contact the broker there
is a random seek to the start of the offset, but the majority of access
patterns are linear.
Fascinating. What are those guarantees going to be? One of the reasons
Rabbit runs a bit slower - one of several - when persisting data, is that
each write it fsync'ed to disk, whereas kafka relies on OS level flushing
IIRC, providing a configurable parameter to force a flush after some
defined number of messages, so as to avoid too much potential data loss in
case of server failure. So in that respect, Rabbit has a highly guarantee
of durability in its current incarnation, with the obvious caveats that
doing so has an adverse affect on performance.
When you say "message guarantees", are we talking about ordering, or
delivery, or both? Very interested to hear about those.
Correct with 0.8 Kafka will have similar options like Rabbit fsync
configuration option. Messages have always had ordering guarantees, but
with 0.8 there is the notion of topic replicas similar to replication
factor in Hadoop or Cassandra.http://www.slideshare.net/junrao/kafka-replication-apachecon2013
With configuration you can tradeoff latency for durability with 3 options.
- Producer receives no acks (no network delay)
- Producer waits for ack from broker leader (1 network roundtrip)
- Producer waits for quorum ack (2 network roundtrips)
With the combination of quorum commits and consumers managing state you can
get much closer to exactly once guarantees i.e. the consumers can manage
their consumption state as well as the consumed messages in the same
On Mon, Jun 10, 2013 at 6:40 AM, Tim Watson <[EMAIL PROTECTED]>wrote: