On Fri, Apr 12, 2013 at 8:27 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
This is why you over-provision the capacity of your Producers and Kafka
cluster. Engineer it that way, taking these requirements into account. If
done correctly, your Producer system should stream that backlog to your
Kafka cluster in much, much less time than 20 minutes. So your system now
has the characteristic that it rarely fails, but if it does, latency may
increase for a little while. But no data is lost and, just as importantly,
the system is *easy to understand and maintain*. Try doing that where the
Producers talk to Kafka, but talk to this other system if Kafka is down,
and do something else if the network is down, and do something else if the
disk is down etc etc.
If your system is incapable of catching up, it hasn't been correctly
designed. Yes, it'll cost more to overprovision, but that's what
engineering is about -- making trade-offs. Only really high-end systems
need more fault-tolerance IMHO. And they are expensive.
Just to be clear, I am not saying buffering in RAM is desirable. It isn't.
But it shouldn't almost never happen. If it is happening even more than
rarely, something is fundamentally wrong with the system.