On 4/22/13 12:03 PM, "Yu, Libo " <[EMAIL PROTECTED]> wrote:
0.8 has some improvements here that others may have more to speak to.
I think this last one is acceptable and cannot be fixed. The 'duplicate'
are expected and should be ok.
A consumer is expected to have one of three qualities:
* Consumption of data does not need to be reliable
* Consumption of data is idempotent (so duplicates do not matter)
* Consumption of data is transactional (so the first set of duplicates are
not committed until ZK has committed its offset)
The situation described above is when a consumer is processing data, and
has an unsafe shutdown (or network outage, etc). If the consumer
successfully consumed batch A, to offset 123, it commits to ZooKeeper that
it has successfully reached offset 123. Then, while processing batch B at
offset 200 it crashes before it commits that it has completed any data
after offset 123.
A new consumer starts back up at the offset following 123, the last
The duplicate data must be handled by the consumer.
Many consumers are idempotent, others do not need 100% reliability.
The remainder must be transactional in order to avoid duplication or loss
To be transactional, one may choose a two-phase-commit approach: The
consumer stages the batch before committing to ZK, then completes the
commit after the ZK offset has been updated; during startup it must
identify whether a prior consumer had an unsafe shutdown with an
outstanding staged batch and clean up appropriately.
Alternatively a two-phase commit can commit first, then change the ZK
offset but in that case it must, as part of the commit, track the offset
coordinates so that if there is a crash prior to updating ZK, it can
recover at startup and skip the duplicate records.