Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - use case with high rate of duplicate messages


Copy link to this message
-
Re: use case with high rate of duplicate messages
Neha Narkhede 2013-10-01, 18:37
It is only available in 0.8.1 (current trunk) which has not been released
yet. We plan to release it right after 0.8-final is out. Here are some
wikis that describe the deduplication feature -

https://cwiki.apache.org/confluence/display/KAFKA/Keyed+Messages+Proposal
https://cwiki.apache.org/confluence/display/KAFKA/Log+Compaction

Thanks,
Neha
On Tue, Oct 1, 2013 at 11:01 AM, Sybrandy, Casey <
[EMAIL PROTECTED]> wrote:

> Interesting.  I didn't know that Kafka had deduplication capabilities.
>  How do you leverage it?  Also, is it supported in Kafka 0.7.x?
>
> -----Original Message-----
> From: Guozhang Wang [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, October 01, 2013 11:33 AM
> To: [EMAIL PROTECTED]
> Subject: Re: use case with high rate of duplicate messages
>
> Batch processing will increase the throughput but also increase latency,
> how large latency your real-time processing can tolerate?
>
> One thing you could try is to use the keyed messages, with key as the md5
> hash of your message. Kafka has a deduplication mechanism on the brokers
> that dedup messages with the same key. All you need to do is setting the
> dedup frequency appropriately for your use case.
>
> Guozhang
>
>
> On Tue, Oct 1, 2013 at 8:19 AM, S Ahmed <[EMAIL PROTECTED]> wrote:
>
> > I have a use case where thousands of servers send status type
> > messages, which I am currently handling real-time w/o any kind of
> queueing system.
> >
> > So currently when I receive a message, and perform a md5 hash of the
> > message, perform a lookup in my database to see if this is a
> > duplicate, if not, I store the message.
> >
> > Now the message format can be either xml or json, and the actual
> > parsing of the message takes time so I would am thinking of storing
> > all the messages in kafka first and then batch processing these
> > messages in hopes that this will be faster to do.
> >
> > Do you think there would be a faster way of recognizing duplicate
> > messages this way or its just the same problem but doing it on a batch
> level?
> >
>
>
>
> --
> -- Guozhang
>