I can't speak for all users, but at LinkedIn we don't do this. We just run
Kafka as a high-availability system (i.e. something not allowed to be
down). These kind of systems require more care, but we already have a
number of such data systems. We chose this approach because local queuing
leads to disk/data management problems on all producers (and we have
thousands) and also late data. Late data makes aggregation very hard since
there will always be more data coming so the aggregate ends up not matching
the base data. This has lead us to a path of working on reliability of the
service itself rather than a store-and-forward model. Likewise the model
itself doesn't necessarily work--as you get to thousands of producers, then
some of those will likely go hard down if the cluster has non-trivial
periods of non-availability, and the data you queued locally is gone since
you have no fault-tolerance for that.
So that was our rationale, but you could easily go the other way. There is
nothing in kafka that prevents producer-side queueing. I could imagine two
1. Many people who want this are basically doing log aggregation. If this
is the case the collector process on the machine would just pause its
collecting if the cluster is unavailable.
2. Alternately it would be possible to embed the kafka log (which is a
standalone system) in the producer and use it for journalling in the case
of errors. Then there could be a background thread that tries to push these
stored messages out.
3. One could just catch any exceptions the producer throws and implement
(2) external to the Kafka client.
On Tue, Jan 15, 2013 at 11:29 AM, Stan Rosenberg
> In out current data ingestion system, producers are resilient in the sense
> that if data cannot be reliably published (e.g., network is down), it is
> spilled onto local disk.
> A separate process runs asynchronously and attempts to publish spilled
> data. I am curious to hear what other people do in this case.
> Is there a plan to have something similar integrated into kafka? (AFAIK,
> current implementation gives up after a configurable number of retries.)