It depends on how you process a batch of compressed messages. In 0.7, the
message offset only advances at the compressed message set boundary. So, if
you always finish processing all messages in a compressed set, there
shouldn't be any duplicates. If say, you stop after consuming only 3
messages in a compressed set of 10, when you refetch, you will get the
first 3 messages again.
On Fri, Jan 10, 2014 at 11:17 PM, Xuyen On <[EMAIL PROTECTED]> wrote:
> Actually, most of the duplicates I was seeing was due to a bug in an old
> Hive version I'm using 0.9.
> But I am still seeing some, although fewer duplicates. Instead of 3-13%
> I'm now only seeing less than 1%. This appears to be the case for each of
> the batch messages for my consumer which is set to be 1,000,000 messages
> right now. Does that seem more reasonable?
> -----Original Message-----
> From: Joel Koshy [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, January 09, 2014 7:07 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Duplicate records in Kafka 0.7
> You mean duplicate records on the consumer side? Duplicates are possible
> if there are consumer failures and a another consumer instance resumes from
> an earlier offset. It is also possible if there are producer retries due to
> exceptions while producing. Do you see any of these errors in your logs?
> Besides these scenarios though, you shouldn't be seeing duplicates.
> On Wed, Jan 8, 2014 at 5:21 PM, Xuyen On <[EMAIL PROTECTED]> wrote:
> > Hi,
> > I would like to check to see if other people are seeing duplicate
> records with Kafka 0.7. I read the Jira's and I believe that duplicates are
> still possible when using message compression on Kafka 0.7. I'm seeing
> duplicate records from the range of 6-13%. Is this normal?
> > If you're using Kafka 0.7 with message compression enabled, can you
> please let me know any duplicate records and if so, what %?
> > Also, please let me know what sort of deduplication strategy you're
> > Thanks!