This is just my opinion of course (who else's could it be? :-)) but I think
from an engineering point of view, one must spend one's time making the
Producer-Kafka connection solid, if it is mission-critical.
Kafka is all about getting messages to disk, and assuming your disks are
solid (and 0.8 has replication) those messages are safe. To then try to
build a system to cope with the Kafka brokers being unavailable seems like
you're setting yourself for infinite regress. And to write code in the
Producer to spool to disk seems even more pointless. If you're that
worried, why not run a dedicated Kafka broker on the same node as the
Producer, and connect over localhost? To turn around and write code to
spool to disk, because the primary system that *spools to disk* is down
seems to be missing the point.
That said, even by going over local-host, I guess the network connection
could go down. In that case, Producers should buffer in RAM, and start
sending some major alerts to the Operations team. But this should almost
*never happen*. If it is happening regularly *something is fundamentally
wrong with your system design*. Those Producers should also refuse any more
incoming traffic and await intervention. Even bringing up "netcat -l" and
letting it suck in the data and write it to disk would work then.
Alternatives include having Producers connect to a load-balancer with
multiple Kafka brokers behind it, which helps you deal with any one Kafka
broker failing. Or just have your Producers connect directly to multiple
Kafka brokers, and switch over as needed if any one broker goes down.
I don't know if the standard Kafka producer that ships with Kafka supports
buffering in RAM in an emergency. We wrote our own that does, with a focus
on speed and simplicity, but I expect it will very rarely, if ever, buffer
Building and using semi-reliable system after semi-reliable system, and
chaining them all together, hoping to be more tolerant of failure is not
necessarily a good approach. Instead, identifying that one system that is
critical, and ensuring that it remains up (redundant installations,
redundant disks, redundant network connections etc) is a better approach
On Fri, Apr 12, 2013 at 7:54 AM, Jun Rao <[EMAIL PROTECTED]> wrote: