It is only available in 0.8.1 (current trunk) which has not been released
yet. We plan to release it right after 0.8-final is out. Here are some
wikis that describe the deduplication feature -
On Tue, Oct 1, 2013 at 11:01 AM, Sybrandy, Casey <
[EMAIL PROTECTED]> wrote:
> Interesting. I didn't know that Kafka had deduplication capabilities.
> How do you leverage it? Also, is it supported in Kafka 0.7.x?
> -----Original Message-----
> From: Guozhang Wang [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, October 01, 2013 11:33 AM
> To: [EMAIL PROTECTED]
> Subject: Re: use case with high rate of duplicate messages
> Batch processing will increase the throughput but also increase latency,
> how large latency your real-time processing can tolerate?
> One thing you could try is to use the keyed messages, with key as the md5
> hash of your message. Kafka has a deduplication mechanism on the brokers
> that dedup messages with the same key. All you need to do is setting the
> dedup frequency appropriately for your use case.
> On Tue, Oct 1, 2013 at 8:19 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
> > I have a use case where thousands of servers send status type
> > messages, which I am currently handling real-time w/o any kind of
> queueing system.
> > So currently when I receive a message, and perform a md5 hash of the
> > message, perform a lookup in my database to see if this is a
> > duplicate, if not, I store the message.
> > Now the message format can be either xml or json, and the actual
> > parsing of the message takes time so I would am thinking of storing
> > all the messages in kafka first and then batch processing these
> > messages in hopes that this will be faster to do.
> > Do you think there would be a faster way of recognizing duplicate
> > messages this way or its just the same problem but doing it on a batch
> -- Guozhang