-Efficient way of handling of duplicates/Deduplication strategy
Jagadish Bihani 2012-12-24, 07:49
I was thinking of 2 possible approaches for this :
Approach 1. Deduplication at destination- Using spooling dir source
-file channel - hdfs sink combination:
==============================================================-- After the HDFS sink has written to the HDFS directory. We can run a
which based on certain unique key deduplicates the events and dumps into
directory. But problem with this approach is that dedup has to depend on
the duration of the failure.
e.g. Events batch A is written to HDFS. But source agent gets killed
before complete file is read and then
renamed. And the agent restarts after a long time (say 8 hours) then
batch A will be duplicated
after 8 hours. But destination deduplication has no way of knowing it.
It runs say hourly. Then it will
happily move initial received batch A to main HDFS directory and final
result will have duplicates.
Can we somehow use Flume headers or any other approach to solve the
above problem scenario ?
Approach 2. Using external RPC client - avro source -file channel - HDFS
==============================================-- External RPC client use SpoolingFileReader API and also maintains
offset for each file
after every batch is successfully written. So whenever an agent fails
file's offset is read from the directory and file is read from that
offset and not from beginning.
Thus this approach has probability of duplication of at most "1 batch"
(There can be scenario when multiple failures can lead to duplicating
more than 1 batch at a time.
But presently I am assuming it as negligible.)
Does this sound like a good approach?