Apologies for asking another question as a newbie without having really
tried stuff out, but actually one of our main reasons for wanting to use
kafka (not the linkedin use case) is exactly the fact that the "buffer" is
not just for buffering. We want to keep data for days to weeks, and be able
to add ad-hoc consumers after the fact (obviously we could do that based on
downstream systems in HDFS), however lets say we have N machines gathering
approximate runtime statistics to use real time in live web applications;
it is easy for them to listen to the stream destined for HDFS and keep such
stats. If we have to add a new machine, or one dies etc. it totally makes
sense to use the same code and just have it replay the last H hours of
events to get back up to speed.
Sorry if my comments caused this type of concern. Keeping days to weeks of
data around is normal in Kafka (it defaults to keeping 7 days worth of data
around, but that's configurable) and replaying from that is definitely
within the realm of what it does well. My comments were more around the
"forever" comments, and as Jay says, it should be possible, you just have
to keep adding more disks and machines to store all the data.
I believe the replication in 0.8 will allow for migration of data if you
lose nodes and stuff too, so maybe my concerns were poorly founded.