I have a use case where thousands of servers send status type messages,
which I am currently handling real-time w/o any kind of queueing system.

So currently when I receive a message, and perform a md5 hash of the
message, perform a lookup in my database to see if this is a duplicate, if
not, I store the message.

Now the message format can be either xml or json, and the actual parsing of
the message takes time so I would am thinking of storing all the messages
in kafka first and then batch processing these messages in hopes that this
will be faster to do.

Do you think there would be a faster way of recognizing duplicate messages
this way or its just the same problem but doing it on a batch level?

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB