Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Using HBase for Deduping


Copy link to this message
-
RE: Using HBase for Deduping
Are all these dupe events expected to be within the same hour or they
can happen over multiple hours ?

Viral
From: Rahul Ravindran
Sent: 2/14/2013 11:41 AM
To: [EMAIL PROTECTED]
Subject: Using HBase for Deduping
Hi,
   We have events which are delivered into our HDFS cluster which may
be duplicated. Each event has a UUID and we were hoping to leverage
HBase to dedupe them. We run a MapReduce job which would perform a
lookup for each UUID on HBase and then emit the event only if the UUID
was absent and would also insert into the HBase table(This is
simplistic, I am missing out details to make this more resilient to
failures). My concern is that doing a Read+Write for every event in MR
would be slow (We expect around 1 Billion events every hour). Does
anyone use Hbase for a similar use case or is there a different
approach to achieving the same end result. Any information, comments
would be great.

Thanks,
~Rahul.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB