In out current data ingestion system, producers are resilient in the sense that if data cannot be reliably published (e.g., network is down), it is spilled onto local disk. A separate process runs asynchronously and attempts to publish spilled data. I am curious to hear what other people do in this case. Is there a plan to have something similar integrated into kafka? (AFAIK, current implementation gives up after a configurable number of retries.)
I can't speak for all users, but at LinkedIn we don't do this. We just run Kafka as a high-availability system (i.e. something not allowed to be down). These kind of systems require more care, but we already have a number of such data systems. We chose this approach because local queuing leads to disk/data management problems on all producers (and we have thousands) and also late data. Late data makes aggregation very hard since there will always be more data coming so the aggregate ends up not matching the base data. This has lead us to a path of working on reliability of the service itself rather than a store-and-forward model. Likewise the model itself doesn't necessarily work--as you get to thousands of producers, then some of those will likely go hard down if the cluster has non-trivial periods of non-availability, and the data you queued locally is gone since you have no fault-tolerance for that.
So that was our rationale, but you could easily go the other way. There is nothing in kafka that prevents producer-side queueing. I could imagine two possible implementations: 1. Many people who want this are basically doing log aggregation. If this is the case the collector process on the machine would just pause its collecting if the cluster is unavailable. 2. Alternately it would be possible to embed the kafka log (which is a standalone system) in the producer and use it for journalling in the case of errors. Then there could be a background thread that tries to push these stored messages out. 3. One could just catch any exceptions the producer throws and implement (2) external to the Kafka client.
-Jay On Tue, Jan 15, 2013 at 11:29 AM, Stan Rosenberg <[EMAIL PROTECTED]>wrote:
On Tue, Jan 15, 2013 at 3:18 PM, Jay Kreps <[EMAIL PROTECTED]> wrote: Yep, we're facing the same problem with respect to late data. I'd like to see alternative solutions to this problem, but I am afraid it's an undecidable problem in general. Likewise the model itself doesn't necessarily work--as you get to thousands
Right. So, you're essentially trading late data for potentially lost data? Option 2 sounds promising.
On Tue, Jan 15, 2013 at 3:12 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote: There is nothing to post at the moment as we're currently in the requirements gathering phase :) Potentially, we might have a contrib project along the lines of options (2), (3) as per Jay's email.
I liked this thing of Facebook scribe that you log to your own machine and then there's a separate process that forwards messages to the central logger.
With Kafka it seems that I have to embed the publisher in my app, and deal with any communication problem managing that on the producer side.
I googled quite a bit trying to find a project that would basically use daemon that parses a log file and send the lines to the Kafka cluster (something like a tail file.log but instead of redirecting the output to the console: send it to kafka)
Does anyone knows about something like that? Thanks! Fernando.
Fernando O. 2015-01-28, 18:42
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation projects and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext