Interesting. .. wouldn't the producer sequence grow without bounds, in the
first case, even with the simpler non-ha of key assumption, to provide a
strict exactly once semantics?
In other words, wouldn't you need to store the entire set of keys that the
broker has ever seen to ensure that a potential replayed message doesn't
make it into the commit; given multiple producers?
In mps (github.com/milindparikh/mps), I use a rotating double bloom filter
to provide a "nearly exactly once" semantics to prevent an without-bound
growth of such a sequence.
On Aug 7, 2013 4:26 PM, "Jay Kreps" <[EMAIL PROTECTED]> wrote:
> Yeah I'm not sure how good our understanding was when we wrote that.
> Here is my take now:
> At least once delivery is not that hard but you need the ability to
> deduplicate things--basically you turn the "at least once delivery channel"
> into the "exactly once channel" by throwing away duplicates. This means
> 1. Some key assigned by the producer that allows the broker to detect a
> re-published message to make publishing idempotent. This solves the problem
> of producer retries. This key obviously has to be highly available--i.e. if
> the leader for a partition fails the follower must correctly deduplicate
> for all committed messages.
> 2. Some key that allows the consumer to detect a re-consumed message.
> The first item is actually pretty doable as we can track some producer
> sequence in the log and use it to avoid duplicate appends. We just need to
> implement it. I think this can be done in a way that is fairly low overhead
> and can be "on by default".
> We actually already provide such a key to the consumer--the offset. Making
> use of this is actually somewhat application dependent. Obviously providing
> exactly-once guarantees in the case of no failures is easy and we already
> handle that case. The harder part is if a consumer process dies to ensure
> that it restarts in a position that exactly matches the state changes that
> it has made in some destination system. If the consumer application uses
> the offset in a way that makes updates idempotent that will work, or if
> they commit their offset and data atomically that works. However in general
> the goal of a consumer is to produce some state change in another system (a
> db, hdfs, some other data system, etc) and having a general solution that
> works with all of these is hard since they have very different limitations
> and features.
> On Wed, Aug 7, 2013 at 4:00 PM, Yang <[EMAIL PROTECTED]> wrote:
> > I wonder why at-least-once guarantee is easier to maintain than
> > exactly-once (in that the latter requires 2PC while the former does not ,
> > according to
> > )
> > if u achieve at-least-once guarantee, you are able to assert between 2
> > cases "nothing" vs ">=1 delivered", which can be seen as 2 different
> > answers 0 and 1. isn't this as hard as the common Byzantine general
> > problem?
> > Thanks
> > Yang