Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka >> mail # user >> Re: Duplicate records in Kafka 0.7


Copy link to this message
-
Re: Duplicate records in Kafka 0.7
It depends on how you process a batch of compressed messages. In 0.7, the
message offset only advances at the compressed message set boundary. So, if
you always finish processing all messages in a compressed set, there
shouldn't be any duplicates. If say, you stop after consuming only 3
messages in a compressed set of 10, when you refetch, you will get the
first 3 messages again.

Thanks,

Jun
On Fri, Jan 10, 2014 at 11:17 PM, Xuyen On <[EMAIL PROTECTED]> wrote:

> Actually, most of the duplicates I was seeing was due to a bug in an old
> Hive version I'm using 0.9.
> But I am still seeing some, although fewer duplicates. Instead of 3-13%
> I'm now only seeing less than 1%. This appears to be the case for each of
> the batch messages for my consumer which is set to be 1,000,000 messages
> right now. Does that seem more reasonable?
>
> -----Original Message-----
> From: Joel Koshy [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, January 09, 2014 7:07 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Duplicate records in Kafka 0.7
>
> You mean duplicate records on the consumer side? Duplicates are possible
> if there are consumer failures and a another consumer instance resumes from
> an earlier offset. It is also possible if there are producer retries due to
> exceptions while producing. Do you see any of these errors in your logs?
> Besides these scenarios though, you shouldn't be seeing duplicates.
>
> Thanks,
>
> Joel
>
>
> On Wed, Jan 8, 2014 at 5:21 PM, Xuyen On <[EMAIL PROTECTED]> wrote:
> > Hi,
> >
> > I would like to check to see if other people are seeing duplicate
> records with Kafka 0.7. I read the Jira's and I believe that duplicates are
> still possible when using message compression on Kafka 0.7. I'm seeing
> duplicate records from the range of 6-13%. Is this normal?
> >
> > If you're using Kafka 0.7 with message compression enabled, can you
> please let me know any duplicate records and if so, what %?
> >
> > Also, please let me know what sort of deduplication strategy you're
> using.
> >
> > Thanks!
> >
> >
>
>
>

 
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB