Ross -- thanks.
How much code are you writing to do all this, post-Kafka? Have you
considered Storm? I believe the Trident topologies can give you
guaranteed-once semantics, so you may be interested in checking that
out, if you have the time (I have not yet played with Trident stuff
myself, but Storm in general, yes). Coupling Storm to Kafka is a very
popular thing to do. Even without Trident, and just using Storm in a
simpler mode, may save you from writing a ton of code.
On Thu, Aug 22, 2013 at 11:59 PM, Ross Black <[EMAIL PROTECTED]> wrote:
> Hi Phillip,
> If I can assume that all messages within a single partition are ordered the
> same as delivery order, the state management to eliminate duplicates is far
> I am using Kafka as the infrastructure for a streaming map/reduce style
> solution, where throughput is critical.
> Events are sent into topic A, which is partitioned based on event id.
> Consumers of topic A generate data that is sent to a different topic B,
> which is partitioned by a persistence key. Consumers of topic B save the
> data to a partitioned store. Each stage can be single-threaded by the
> partition, which results in zero contention on the partitioned data store
> and massively improves the throughput.
> Message offsets are used to end-to-end to eliminate duplicates, so the
> application effectively achieves guaranteed once-only processing of
> messages. Currently, any out-of-order messages result in data being
> dropped because duplicate tracking is based *only* on message offsets. If
> ordering within a partition is not guaranteed, I would need to track
> maintain a list of message offsets that have been processed, rather than
> having to track just the latest message offset for a partition (and would
> need to persist this list of offsets to allow resume after failure).
> The assumption of guaranteed order is essential for the throughput the
> application achieves.
> On 23 August 2013 14:36, Philip O'Toole <[EMAIL PROTECTED]> wrote:
>> I am curious. What is it about your design that requires you track order
>> so tightly? Maybe there is another way to meet your needs instead of
>> relying on Kafka to do it.
>> On Aug 22, 2013, at 9:32 PM, Ross Black <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> > I am using Kafka 0.7.1, and using the low-level SyncProducer to send
>> > messages to a *single* partition from a *single* thread.
>> > The client sends messages that contain sequential numbers so it is
>> > at the consumer when message order is shuffled.
>> > I have noticed that messages can be saved out-or-order by Kafka when
>> > are connection problems, and am looking for possible solutions (I think I
>> > already know the cause).
>> > The client sends messages in a retry loop so that it will wait for a
>> > period and then retry to send on any IO errors. In SyncProducer, any
>> > IOException triggers a disconnect. Next time send is called a new
>> > connection is established. I believe that it is this
>> > cycle that can cause messages to be saved to the kafka log in a different
>> > order to that of the client.
>> > I had previously had the same sort of issue with reconnect.interval/time,
>> > which was fixed by disabling those reconnect settings.
>> > Is there anything in 0.7 that would allow me to solve this problem? The
>> > only option I can see at the moment is to not perform retries.
>> > Does 0.8 handle this issue any differently?
>> > Thanks,
>> > Ross