In summary, it appears the primary issue is that Kafka keeps file handles of each log segment open. Is there a way to configure this, or is a way to do so planned? It appears that an option to deduplicate instead of delete was added recently, so doesn't the file handle issue exist with that as well (since files aren't being deleted)?
Is there a reason you wouldn't want to just push the data into something built for cheap, long-term storage (like glacier, S3, or HDFS) and perhaps "replay" from that instead of from the kafka brokers? I can't speak for Jay, Jun or Neha, but I believe the expected usage of Kafka is essentially as a buffering mechanism to take the edge off the natural ebb-n-flow of unpredictable internet traffic. The highly available, long-term storage of data is probably not at the top of their list of use cases when making design decisions.
On Thu, Feb 21, 2013 at 6:00 PM, Anthony Grimes <[EMAIL PROTECTED]> wrote:
Forever is a long time. The definition of replay and navigating through different versions of kafka would be key.
Example: If you are storing market data into kafka and have a cep engine running on top and would like replay "transactions" to be fed back to ensure replayability, then you would probably want to manage that through the same mechanism as it existed at that time in the past. This might mean a different kafka broker (perhaps 0.7) with a different set of consumers with a potentially different JVM. This, of course, gets into a rat hole.
Regards Milind On Thu, Feb 21, 2013 at 4:29 PM, Eric Tschetter <[EMAIL PROTECTED]>wrote:
You can do this and it should work fine. You would have to keep adding machines to get disk capacity, of course, since your data set would only grow.
We will keep an open file descriptor per file, but I think that is okay. Just set the segment size to 1GB, then with 10TB of storage that is only 10k files which should be fine. Adjust the OS open FD limit up a bit if needed. File descriptors don't use too much memory so this should not hurt anything.
On Thu, Feb 21, 2013 at 4:00 PM, Anthony Grimes <[EMAIL PROTECTED]> wrote:
Apologies for asking another question as a newbie without having really tried stuff out, but actually one of our main reasons for wanting to use kafka (not the linkedin use case) is exactly the fact that the "buffer" is not just for buffering. We want to keep data for days to weeks, and be able to add ad-hoc consumers after the fact (obviously we could do that based on downstream systems in HDFS), however lets say we have N machines gathering approximate runtime statistics to use real time in live web applications; it is easy for them to listen to the stream destined for HDFS and keep such stats. If we have to add a new machine, or one dies etc. it totally makes sense to use the same code and just have it replay the last H hours of events to get back up to speed.
Sorry if my comments caused this type of concern. Keeping days to weeks of data around is normal in Kafka (it defaults to keeping 7 days worth of data around, but that's configurable) and replaying from that is definitely within the realm of what it does well. My comments were more around the "forever" comments, and as Jay says, it should be possible, you just have to keep adding more disks and machines to store all the data.
I believe the replication in 0.8 will allow for migration of data if you lose nodes and stuff too, so maybe my concerns were poorly founded.