|
|
Stan Rosenberg 2013-01-15, 19:29
Hi,
In out current data ingestion system, producers are resilient in the sense that if data cannot be reliably published (e.g., network is down), it is spilled onto local disk. A separate process runs asynchronously and attempts to publish spilled data. I am curious to hear what other people do in this case. Is there a plan to have something similar integrated into kafka? (AFAIK, current implementation gives up after a configurable number of retries.)
Thanks,
stan
+
Stan Rosenberg 2013-01-15, 19:29
Corbin Hoenes 2013-01-15, 20:13
+1 how about posting yours to GitHub? Sounds like a good contrib project.
Sent from my iPhone
On Jan 15, 2013, at 12:29 PM, Stan Rosenberg <[EMAIL PROTECTED]> wrote:
> Hi, > > In out current data ingestion system, producers are resilient in the sense > that if data cannot be reliably published (e.g., network is down), it is > spilled onto local disk. > A separate process runs asynchronously and attempts to publish spilled > data. I am curious to hear what other people do in this case. > Is there a plan to have something similar integrated into kafka? (AFAIK, > current implementation gives up after a configurable number of retries.) > > Thanks, > > stan
+
Corbin Hoenes 2013-01-15, 20:13
Stan Rosenberg 2013-01-15, 22:32
On Tue, Jan 15, 2013 at 3:12 PM, Corbin Hoenes <[EMAIL PROTECTED]> wrote:
> +1 how about posting yours to GitHub? > Sounds like a good contrib project. >
There is nothing to post at the moment as we're currently in the requirements gathering phase :) Potentially, we might have a contrib project along the lines of options (2), (3) as per Jay's email.
+
Stan Rosenberg 2013-01-15, 22:32
Jay Kreps 2013-01-15, 20:19
I can't speak for all users, but at LinkedIn we don't do this. We just run Kafka as a high-availability system (i.e. something not allowed to be down). These kind of systems require more care, but we already have a number of such data systems. We chose this approach because local queuing leads to disk/data management problems on all producers (and we have thousands) and also late data. Late data makes aggregation very hard since there will always be more data coming so the aggregate ends up not matching the base data. This has lead us to a path of working on reliability of the service itself rather than a store-and-forward model. Likewise the model itself doesn't necessarily work--as you get to thousands of producers, then some of those will likely go hard down if the cluster has non-trivial periods of non-availability, and the data you queued locally is gone since you have no fault-tolerance for that.
So that was our rationale, but you could easily go the other way. There is nothing in kafka that prevents producer-side queueing. I could imagine two possible implementations: 1. Many people who want this are basically doing log aggregation. If this is the case the collector process on the machine would just pause its collecting if the cluster is unavailable. 2. Alternately it would be possible to embed the kafka log (which is a standalone system) in the producer and use it for journalling in the case of errors. Then there could be a background thread that tries to push these stored messages out. 3. One could just catch any exceptions the producer throws and implement (2) external to the Kafka client.
-Jay On Tue, Jan 15, 2013 at 11:29 AM, Stan Rosenberg <[EMAIL PROTECTED]>wrote:
> Hi, > > In out current data ingestion system, producers are resilient in the sense > that if data cannot be reliably published (e.g., network is down), it is > spilled onto local disk. > A separate process runs asynchronously and attempts to publish spilled > data. I am curious to hear what other people do in this case. > Is there a plan to have something similar integrated into kafka? (AFAIK, > current implementation gives up after a configurable number of retries.) > > Thanks, > > stan >
+
Jay Kreps 2013-01-15, 20:19
Stan Rosenberg 2013-01-15, 22:30
Jay,
Thanks for your insight! More comments are below.
On Tue, Jan 15, 2013 at 3:18 PM, Jay Kreps <[EMAIL PROTECTED]> wrote:
> I can't speak for all users, but at LinkedIn we don't do this. We just run > Kafka as a high-availability system (i.e. something not allowed to be > down). These kind of systems require more care, but we already have a > number of such data systems. We chose this approach because local queuing > leads to disk/data management problems on all producers (and we have > thousands) and also late data. Late data makes aggregation very hard since > there will always be more data coming so the aggregate ends up not matching > the base data. >
Yep, we're facing the same problem with respect to late data. I'd like to see alternative solutions to this problem, but I am afraid it's an undecidable problem in general. > This has lead us to a path of working on reliability of the service itself > rather than a store-and-forward model. > Likewise the model itself doesn't necessarily work--as you get to thousands > of producers, then some of those will likely go hard down if the cluster > has non-trivial periods of non-availability, and the data you queued > locally is gone since you have no fault-tolerance for that. >
Right. So, you're essentially trading late data for potentially lost data?
> So that was our rationale, but you could easily go the other way. There is > nothing in kafka that prevents producer-side queueing. I could imagine two > possible implementations: > 1. Many people who want this are basically doing log aggregation. If this > is the case the collector process on the machine would just pause its > collecting if the cluster is unavailable. > 2. Alternately it would be possible to embed the kafka log (which is a > standalone system) in the producer and use it for journalling in the case > of errors. Then there could be a background thread that tries to push these > stored messages out. > 3. One could just catch any exceptions the producer throws and implement > (2) external to the Kafka client. >
Option 2 sounds promising.
+
Stan Rosenberg 2013-01-15, 22:30
|
|