Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Flume >> mail # user >> Efficient way of handling of duplicates/Deduplication strategy


Copy link to this message
-
Efficient way of handling of duplicates/Deduplication strategy
Hi

I was thinking of 2 possible approaches for this :

Approach 1.  Deduplication at destination- Using spooling dir source
-file channel - hdfs sink combination:
==============================================================-- After the HDFS sink has written to the HDFS directory. We can run a
map-reduce job
which based on certain unique key deduplicates the events and dumps into
another HDFS
directory. But problem with this approach is that dedup has to depend on
the duration of the failure.
e.g. Events batch A is written to HDFS. But source agent gets killed
before complete file is read and then
renamed. And the agent restarts after a  long time (say 8 hours) then
batch A will be duplicated
after 8 hours. But destination deduplication has no way of knowing it.
It runs say hourly. Then it will
happily move initial received batch A to main HDFS directory and final
result will have duplicates.

Can we somehow use Flume headers or any other approach to solve the
above problem scenario ?

Approach 2. Using external RPC client - avro source -file channel - HDFS
sink
==============================================-- External RPC client use SpoolingFileReader API and also maintains
offset for each file
after every batch is successfully written. So whenever an agent fails
and restarts
file's offset is read from the directory and file is read from that
offset and not from beginning.
Thus this approach has probability of duplication of at most "1 batch"
per failure.
(There can be scenario when multiple failures can lead to duplicating
more than 1 batch at a time.
But presently I am assuming it as negligible.)

Does this sound like a good approach?
Regards,
Jagadish
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB