Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - RE: Using HBase for Deduping


+
Viral Bajaria 2013-02-14, 20:19
+
Rahul Ravindran 2013-02-14, 20:29
+
Viral Bajaria 2013-02-14, 20:48
+
Rahul Ravindran 2013-02-14, 21:07
+
Viral Bajaria 2013-02-14, 21:24
+
Michael Segel 2013-02-15, 01:33
Copy link to this message
-
Re: Using HBase for Deduping
Rahul Ravindran 2013-02-15, 01:35
Checkandput() does not work when the row does not exist, or am I missing something?

Sent from my phone.Excuse the terseness.

On Feb 14, 2013, at 5:33 PM, Michael Segel <[EMAIL PROTECTED]> wrote:

> What constitutes a duplicate?
>
> An over simplification is to do a HTable.checkAndPut() where you do the put if the column doesn't exist.
> Then if the row is inserted (TRUE) return value, you push the event.
>
> That will do what you want.
>
> At least at first blush.
>
>
>
> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[EMAIL PROTECTED]> wrote:
>
>> Given the size of the data (> 1B rows) and the frequency of job run (once
>> per hour), I don't think your most optimal solution is to lookup HBase for
>> every single event. You will benefit more by loading the HBase table
>> directly in your MR job.
>>
>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ?
>>
>> Also once you have done the unique, are you going to use the data again in
>> some other way i.e. online serving of traffic or some other analysis ? Or
>> this is just to compute some unique #'s ?
>>
>> It will be more helpful if you describe your final use case of the computed
>> data too. Given the amount of back and forth, we can take it off list too
>> and summarize the conversation for the list.
>>
>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:
>>
>>> We can't rely on the the assumption event dupes will not dupe outside an
>>> hour boundary. So, your take is that, doing a lookup per event within the
>>> MR job is going to be bad?
>>>
>>>
>>> ________________________________
>>> From: Viral Bajaria <[EMAIL PROTECTED]>
>>> To: Rahul Ravindran <[EMAIL PROTECTED]>
>>> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>> Sent: Thursday, February 14, 2013 12:48 PM
>>> Subject: Re: Using HBase for Deduping
>>>
>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>> lookups. I don't think this is the best solution either given the # of
>>> events you will get.
>>>
>>> FWIW, the solution below again relies on the assumption that if a event is
>>> duped in the same hour it won't have a dupe outside of that hour boundary.
>>> If it can have then you are better of with running a MR job with the
>>> current hour + another 3 hours of data or an MR job with the current hour +
>>> the HBase table as input to the job too (i.e. no HBase lookups, just read
>>> the HFile directly) ?
>>>
>>> - Run a MR job which de-dupes events for the current hour i.e. only runs on
>>> 1 hour worth of data.
>>> - Mark records which you were not able to de-dupe in the current run
>>> - For the records that you were not able to de-dupe, check against HBase
>>> whether you saw that event in the past. If you did, you can drop the
>>> current event or update the event to the new value (based on your business
>>> logic)
>>> - Save all the de-duped events (via HBase bulk upload)
>>>
>>> Sorry if I just rambled along, but without knowing the whole problem it's
>>> very tough to come up with a probable solution. So correct my assumptions
>>> and we could drill down more.
>>>
>>> Thanks,
>>> Viral
>>>
>>> On Thu, Feb 14, 2013 at 12:29 PM, Rahul Ravindran <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>> Most will be in the same hour. Some will be across 3-6 hours.
>>>>
>>>> Sent from my phone.Excuse the terseness.
>>>>
>>>> On Feb 14, 2013, at 12:19 PM, Viral Bajaria <[EMAIL PROTECTED]>
>>>> wrote:
>>>>
>>>>> Are all these dupe events expected to be within the same hour or they
>>>>> can happen over multiple hours ?
>>>>>
>>>>> Viral
>>>>> From: Rahul Ravindran
>>>>> Sent: 2/14/2013 11:41 AM
>>>>> To: [EMAIL PROTECTED]
>>>>> Subject: Using HBase for Deduping
>>>>> Hi,
>>>>>  We have events which are delivered into our HDFS cluster which may
>>>>> be duplicated. Each event has a UUID and we were hoping to leverage
>>>>> HBase to dedupe them. We run a MapReduce job which would perform a
+
Michael Segel 2013-02-15, 01:43
+
Rahul Ravindran 2013-02-14, 19:40
+
Asaf Mesika 2013-02-14, 22:26
+
Anoop Sam John 2013-02-15, 04:19
+
Rahul Ravindran 2013-02-19, 17:32