-Re: Patterns for message failure handling with Kafka
Jason Rosenberg 2014-01-21, 22:55
So, I think there are 2 different types of errors you mention. The
first is data-dependent (e.g. it's corrupt or some such). So, there's
no reason to block consumption of other messages that are likely to be
successful, while the data-dependent one won't fix itself no matter
times you retry. So, for that, I think it makes sense to stash it
away to be retried later (or just logged as invalid and carry on).
For transient failures (e.g. a downstream dependent service is not
available), then I think it's fine to just keep retrying (with
exponential back off) until it succeeds, before processing the next
message (which will also likely fail for the same reason).
On Tue, Jan 21, 2014 at 5:46 PM, Jim <[EMAIL PROTECTED]> wrote:
> I'm looking at message delivery patterns for Kafka consumers and wanted to
> get people's thoughts on the following problem:
> The objective is to ensure processing of individual messages with as much
> certainty as possible for "at least once guarantees". I'm looking to have a
> kafka consumer pull n messages, assuming 100 for arguments sake, process
> them, commit the offset, then grab 100 more.
> The issue comes in where you have single message failure. For example
> message 30 cannot be deserialized, message 40 failed because of some 3rd
> party service that was down for an instant, etc... So we're looking at
> having a topic and a topic_retry pattern for consumers so that if there was
> a single message failure we'd put messages 30 and 40 in the retry topic
> with a failure count of 1 and if that failure count passes 3 it goes to
> cold storage for manual analysis. Once we have processed all 100 either by
> success or making sure they were re-enqueued we commit the offset, then
> grab more messages. If the percentage of retry topics goes over a threshold
> trip a circuit breaker for the consumer to stop pulling messages until the
> issue can be resolved to prevent re-try flooding.
> What are some patterns around this that people are using currently to
> handle message failures at scale with kafka?
> pardon if this is a frequent question but the
> http://search-hadoop.com/kafka server
> is down so I can't search historicals at the moment.