Based on our observations on our production setup in flume:
We have seen file roll sink delivering almost 1% events greater than those
delivered by HDFS sink per day.
(We have replicating setup and two different
file channels for the sinks).
Flume topology: 30 first tier machines and 3 second tier machines (which
deliver to HDFS and local file system)
HDFS compression codec :lzop
Channels : File channel for every source-sink pair.
Hadoop version :1.0.3 (Apache Hadoop)
Things are working fine but we see some data loss in the HDFS (though
not very huge
1 million in 1 billion events).
Is it possible in some scenario? (Just to add datanodes of the hadoop
cluster are highly loaded. Can that lead to any disaster?)