Just to add to Mike's response:
When used with secure channels(mainly file channel) and with transports
that can be rolled back(avro), message delivery is
guarranteed(eventually). The only way you can lose data is for a part of
the chain to be permanently removed: HD failure or removal of the
Prevention of data duplication has never been an objective of flume,
though it is uncommon in a properly configured setup. The larger your
batch sizes are, the more duplication you may get with each partial
failure. Similarly ordered arrival of data is not guarranteed. The best
way to address these two issues, if it is a concern, is to run a
map-reduce task or similar to reduce to unique rows and/or reorder.
On 01/24/2013 12:26 PM, Henry Ma wrote:
> Dear Flume developers and users,
> I understand that Flume NG uses channel-based transactions to
> guarantee reliable message delivery between agents. But in
> some extreme failure scenes, will Flume keep total Reliability? I have
> thought of these scenes below.
> 1. In transactions between agent, what will happen if the receiving
> agent process down just after it commits its put transaction and
> before sends the success indication to the sending agent? Will the
> sending agent send the same event again when the receiving agent
> recovers, and cause data duplication?
> 2. In the communication between the client (data source, sending data
> to the first-hop agent) and the first-hop agent, what will happen if
> the agent process down just after it receives the event and before
> saves to its channel? Will it cause data loss?
> 3. In the communication between the final-hup agent and the storage
> system (such as MySQL, HDFS, file system, etc.), what happened if the
> agent down before it commits the saving transaction but has saved some
> data in the storage? Will this cause data duplication after the
> recover of the agent?
> Thank you very much!
> Best Regards,
> Henry Ma