Yeah I agree, this is a problem.
The issue is that a produce request which is either in the network buffer
or in the request processing queue on the broker may still be processed
after a disconnect. So there is a race condition between that processing
and the reconnect/retry logic. You could work around this in a hacky way
using the reconnect backoff time, but the fundamental race condition
exists. We could easily make this more transparent by having some mode
where disconnection throws an error back to the client, but in fact there
is no way for the client to solve this either.
Neither Storm nor Samza nor any other framework would actually fix this
issue for you, since they are in turn dependent on Kafka's ordering (though
they might solve a lot of other problems).
As Jun mentions we have been thinking of having a per-producer sequence
number to enforce ordering. This would allow us to make produce calls
idempotent, enforce strong ordering in the case of retries, as well as fix
a number of other corner cases. I think it would handle this issue as well.
But it's not a quick patch.
I will try to get a design proposal up by next week so we have something
concrete to discuss.
On Thu, Aug 22, 2013 at 9:32 PM, Ross Black <[EMAIL PROTECTED]> wrote:
> I am using Kafka 0.7.1, and using the low-level SyncProducer to send
> messages to a *single* partition from a *single* thread.
> The client sends messages that contain sequential numbers so it is obvious
> at the consumer when message order is shuffled.
> I have noticed that messages can be saved out-or-order by Kafka when there
> are connection problems, and am looking for possible solutions (I think I
> already know the cause).
> The client sends messages in a retry loop so that it will wait for a short
> period and then retry to send on any IO errors. In SyncProducer, any
> IOException triggers a disconnect. Next time send is called a new
> connection is established. I believe that it is this disconnect/reconnect
> cycle that can cause messages to be saved to the kafka log in a different
> order to that of the client.
> I had previously had the same sort of issue with reconnect.interval/time,
> which was fixed by disabling those reconnect settings.
> Is there anything in 0.7 that would allow me to solve this problem? The
> only option I can see at the moment is to not perform retries.
> Does 0.8 handle this issue any differently?