Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Analysis of producer performance

Copy link to this message
Re: Analysis of producer performance -- and Producer-Kafka reliability
On Fri, Apr 12, 2013 at 8:27 AM, S Ahmed <[EMAIL PROTECTED]> wrote:

> Interesting topic.
> How would buffering in RAM help in reality though (just trying to work
> through the scenerio in my head):
> producer tries to connect to a broker, it fails, so it appends the message
> to a in-memory store.  If the broker is down for say 20 minutes and then
> comes back online, won't this create problems now when the producer creates
> a new message, and it has 20 minutes of backlog, and the broker now is
> handling more load (assuming you are sending those in-memory messages using
> a different thread)?

This is why you over-provision the capacity of your Producers and Kafka
cluster. Engineer it that way, taking these requirements into account. If
done correctly, your Producer system should stream that backlog to your
Kafka cluster in much, much less time than 20 minutes. So your system now
has the characteristic that it rarely fails, but if it does, latency may
increase for a little while. But no data is lost and, just as importantly,
the system is *easy to understand and maintain*. Try doing that where the
Producers talk to Kafka, but talk to this other system if Kafka is down,
and do something else if the network is down, and do something else if the
disk is down etc etc.

If your system is incapable of catching up, it hasn't been correctly
designed. Yes, it'll cost more to overprovision, but that's what
engineering is about -- making trade-offs. Only really high-end systems
need more fault-tolerance IMHO. And they are expensive.

Just to be clear, I am not saying buffering in RAM is desirable. It isn't.
But it shouldn't almost never happen. If it is happening even more than
rarely, something is fundamentally wrong with the system.

> On Fri, Apr 12, 2013 at 11:21 AM, Philip O'Toole <[EMAIL PROTECTED]>
> wrote:
> > This is just my opinion of course (who else's could it be? :-)) but I
> think
> > from an engineering point of view, one must spend one's time making the
> > Producer-Kafka connection solid, if it is mission-critical.
> >
> > Kafka is all about getting messages to disk, and assuming your disks are
> > solid (and 0.8 has replication) those messages are safe. To then try to
> > build a system to cope with the Kafka brokers being unavailable seems
> like
> > you're setting yourself for infinite regress. And to write code in the
> > Producer to spool to disk seems even more pointless. If you're that
> > worried, why not run a dedicated Kafka broker on the same node as the
> > Producer, and connect over localhost? To turn around and write code to
> > spool to disk, because the primary system that *spools to disk* is down
> > seems to be missing the point.
> >
> > That said, even by going over local-host, I guess the network connection
> > could go down. In that case, Producers should buffer in RAM, and start
> > sending some major alerts to the Operations team. But this should almost
> > *never happen*. If it is happening regularly *something is fundamentally
> > wrong with your system design*. Those Producers should also refuse any
> more
> > incoming traffic and await intervention. Even bringing up "netcat -l" and
> > letting it suck in the data and write it to disk would work then.
> > Alternatives include having Producers connect to a load-balancer with
> > multiple Kafka brokers behind it, which helps you deal with any one Kafka
> > broker failing. Or just have your Producers connect directly to multiple
> > Kafka brokers, and switch over as needed if any one broker goes down.
> >
> > I don't know if the standard Kafka producer that ships with Kafka
> supports
> > buffering in RAM in an emergency. We wrote our own that does, with a
> focus
> > on speed and simplicity, but I expect it will very rarely, if ever,
> buffer
> > in RAM.
> >
> > Building and using semi-reliable system after semi-reliable system, and
> > chaining them all together, hoping to be more tolerant of failure is not
> > necessarily a good approach. Instead, identifying that one system that is