Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Kafka, mail # user - Re: duplicat and lost message in Kafka


Copy link to this message
-
Re: duplicat and lost message in Kafka
Scott Carey 2013-04-22, 21:31
On 4/22/13 12:03 PM, "Yu, Libo " <[EMAIL PROTECTED]> wrote:

>Hi,
>I have read both the Kafka design document at
>http://kafka.apache.org/design.html
>and the paper. And I have two questions about lost and duplicate message.
>
>"Without acknowledging the producer, there is no guarantee that every
>published
>message is actually received by the broker." Is this still true as of
>now? Is there a
>solution to this issue in release 0.8?

0.8 has some improvements here that others may have more to speak to.

>
>"However, in the case when a consumer process crashes without a clean
>shutdown,
>the consumer process that takes over those partitions owned by the failed
>consumer
>may get some duplicate messages that are after the last offset
>successfully committed
>to zookeeper."  Does this issue have to be fixed on the consumer side as
>well? Why
>not fixed in Kafka?

I think this last one is acceptable and cannot be fixed.  The 'duplicate'
messages here
are expected and should be ok.

A consumer is expected to have one of three qualities:
* Consumption of data does not need to be reliable
* Consumption of data is idempotent (so duplicates do not matter)
* Consumption of data is transactional (so the first set of duplicates are
not committed until ZK has committed its offset)

The situation described above is when a consumer is processing data, and
has an unsafe shutdown (or network outage, etc).  If the consumer
successfully consumed batch A, to offset 123, it commits to ZooKeeper that
it has successfully reached offset 123.  Then, while processing batch B at
offset 200 it crashes before it commits that it has completed any data
after offset 123.

A new consumer starts back up at the offset following 123, the last
'checkpoint'.

The duplicate data must be handled by the consumer.

Many consumers are idempotent, others do not need 100% reliability.
The remainder must be transactional in order to avoid duplication or loss
of data.
To be transactional, one may choose a two-phase-commit approach: The
consumer stages the batch before committing to ZK, then completes the
commit after the ZK offset has been updated; during startup it must
identify whether a prior consumer had an unsafe shutdown with an
outstanding staged batch and clean up appropriately.
Alternatively a two-phase commit can commit first, then change the ZK
offset but in that case it must, as part of the commit, track the offset
coordinates so that if there is a crash prior to updating ZK, it can
recover at startup and skip the duplicate records.
>
>Thanks.
>
>Regards,
>
>Libo Yu
>