Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> RE: Using HBase for Deduping

Viral Bajaria 2013-02-14, 20:19
Rahul Ravindran 2013-02-14, 20:29
Viral Bajaria 2013-02-14, 20:48
Rahul Ravindran 2013-02-14, 21:07
Viral Bajaria 2013-02-14, 21:24
Michael Segel 2013-02-15, 01:33
Rahul Ravindran 2013-02-15, 01:35
Michael Segel 2013-02-15, 01:43
Rahul Ravindran 2013-02-14, 19:40
Copy link to this message
Re: Using HBase for Deduping
You can load the events into an Hbase table, which has the event id as the
unique row key. You can define max versions of 1 to the column family thus
letting Hbase get rid of the duplicates for you during major compaction.

On Thursday, February 14, 2013, Rahul Ravindran wrote:

> Hi,
>    We have events which are delivered into our HDFS cluster which may be
> duplicated. Each event has a UUID and we were hoping to leverage HBase to
> dedupe them. We run a MapReduce job which would perform a lookup for each
> UUID on HBase and then emit the event only if the UUID was absent and would
> also insert into the HBase table(This is simplistic, I am missing out
> details to make this more resilient to failures). My concern is that doing
> a Read+Write for every event in MR would be slow (We expect around 1
> Billion events every hour). Does anyone use Hbase for a similar use case or
> is there a different approach to achieving the same end result. Any
> information, comments would be great.
> Thanks,
> ~Rahul.
Anoop Sam John 2013-02-15, 04:19
Rahul Ravindran 2013-02-19, 17:32