|
|
-
Efficient way of handling of duplicates/Deduplication strategyJagadish Bihani 2012-12-24, 07:49
Hi
I was thinking of 2 possible approaches for this : Approach 1. Deduplication at destination- Using spooling dir source -file channel - hdfs sink combination: ==============================================================-- After the HDFS sink has written to the HDFS directory. We can run a map-reduce job which based on certain unique key deduplicates the events and dumps into another HDFS directory. But problem with this approach is that dedup has to depend on the duration of the failure. e.g. Events batch A is written to HDFS. But source agent gets killed before complete file is read and then renamed. And the agent restarts after a long time (say 8 hours) then batch A will be duplicated after 8 hours. But destination deduplication has no way of knowing it. It runs say hourly. Then it will happily move initial received batch A to main HDFS directory and final result will have duplicates. Can we somehow use Flume headers or any other approach to solve the above problem scenario ? Approach 2. Using external RPC client - avro source -file channel - HDFS sink ==============================================-- External RPC client use SpoolingFileReader API and also maintains offset for each file after every batch is successfully written. So whenever an agent fails and restarts file's offset is read from the directory and file is read from that offset and not from beginning. Thus this approach has probability of duplication of at most "1 batch" per failure. (There can be scenario when multiple failures can lead to duplicating more than 1 batch at a time. But presently I am assuming it as negligible.) Does this sound like a good approach? Regards, Jagadish |