Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> RE: Using HBase for Deduping


+
Viral Bajaria 2013-02-14, 20:19
Copy link to this message
-
Re: Using HBase for Deduping
Most will be in the same hour. Some will be across 3-6 hours.

Sent from my phone.Excuse the terseness.

On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[EMAIL PROTECTED]> wrote:

> Are all these dupe events expected to be within the same hour or they
> can happen over multiple hours ?
>
> Viral
> From: Rahul Ravindran
> Sent: 2/14/2013 11:41 AM
> To: [EMAIL PROTECTED]
> Subject: Using HBase for Deduping
> Hi,
>    We have events which are delivered into our HDFS cluster which may
> be duplicated. Each event has a UUID and we were hoping to leverage
> HBase to dedupe them. We run a MapReduce job which would perform a
> lookup for each UUID on HBase and then emit the event only if the UUID
> was absent and would also insert into the HBase table(This is
> simplistic, I am missing out details to make this more resilient to
> failures). My concern is that doing a Read+Write for every event in MR
> would be slow (We expect around 1 Billion events every hour). Does
> anyone use Hbase for a similar use case or is there a different
> approach to achieving the same end result. Any information, comments
> would be great.
>
> Thanks,
> ~Rahul.
+
Viral Bajaria 2013-02-14, 20:48
+
Rahul Ravindran 2013-02-14, 21:07
+
Viral Bajaria 2013-02-14, 21:24
+
Michael Segel 2013-02-15, 01:33
+
Rahul Ravindran 2013-02-15, 01:35
+
Michael Segel 2013-02-15, 01:43
+
Rahul Ravindran 2013-02-14, 19:40
+
Asaf Mesika 2013-02-14, 22:26
+
Anoop Sam John 2013-02-15, 04:19
+
Rahul Ravindran 2013-02-19, 17:32
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB