Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> use case with high rate of duplicate messages

Copy link to this message
use case with high rate of duplicate messages
I have a use case where thousands of servers send status type messages,
which I am currently handling real-time w/o any kind of queueing system.

So currently when I receive a message, and perform a md5 hash of the
message, perform a lookup in my database to see if this is a duplicate, if
not, I store the message.

Now the message format can be either xml or json, and the actual parsing of
the message takes time so I would am thinking of storing all the messages
in kafka first and then batch processing these messages in hopes that this
will be faster to do.

Do you think there would be a faster way of recognizing duplicate messages
this way or its just the same problem but doing it on a batch level?