Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> RE: Using HBase for Deduping


+
Viral Bajaria 2013-02-14, 20:19
Copy link to this message
-
Re: Using HBase for Deduping
Most will be in the same hour. Some will be across 3-6 hours.

Sent from my phone.Excuse the terseness.

On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[EMAIL PROTECTED]> wrote:

> Are all these dupe events expected to be within the same hour or they
> can happen over multiple hours ?
>
> Viral
> From: Rahul Ravindran
> Sent: 2/14/2013 11:41 AM
> To: [EMAIL PROTECTED]
> Subject: Using HBase for Deduping
> Hi,
>    We have events which are delivered into our HDFS cluster which may
> be duplicated. Each event has a UUID and we were hoping to leverage
> HBase to dedupe them. We run a MapReduce job which would perform a
> lookup for each UUID on HBase and then emit the event only if the UUID
> was absent and would also insert into the HBase table(This is
> simplistic, I am missing out details to make this more resilient to
> failures). My concern is that doing a Read+Write for every event in MR
> would be slow (We expect around 1 Billion events every hour). Does
> anyone use Hbase for a similar use case or is there a different
> approach to achieving the same end result. Any information, comments
> would be great.
>
> Thanks,
> ~Rahul.
+
Viral Bajaria 2013-02-14, 20:48
+
Rahul Ravindran 2013-02-14, 21:07
+
Viral Bajaria 2013-02-14, 21:24
+
Michael Segel 2013-02-15, 01:33
+
Rahul Ravindran 2013-02-15, 01:35
+
Michael Segel 2013-02-15, 01:43
+
Rahul Ravindran 2013-02-14, 19:40
+
Asaf Mesika 2013-02-14, 22:26
+
Anoop Sam John 2013-02-15, 04:19
+
Rahul Ravindran 2013-02-19, 17:32