-Re: Using HBase for Deduping
Michael Segel 2013-02-15, 01:43
Maybe its a lack of sleep, but this is what I found...
public boolean checkAndPut(byte row,
Atomically checks if a row/family/qualifier value matches the expected value. If it does, it adds the put. If the passed value is null, the check is for the lack of column (ie: non-existance)
checkAndPut in interface HTableInterface
row - to check
family - column family to check
qualifier - column qualifier to check
value - the expected value
put - data to put if check succeeds
true if the new put was executed, false otherwise
IOException - e
Maybe I'm reading it wrong?
But hey! What do I know? Its Valentine's Day and I'm spending my evening answering questions sitting in my man cave instead of spending it with my wife.
Its no wonder I live in the perpetual dog house! :-P
On Feb 14, 2013, at 7:35 PM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:
> Checkandput() does not work when the row does not exist, or am I missing something?
> Sent from my phone.Excuse the terseness.
> On Feb 14, 2013, at 5:33 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
>> What constitutes a duplicate?
>> An over simplification is to do a HTable.checkAndPut() where you do the put if the column doesn't exist.
>> Then if the row is inserted (TRUE) return value, you push the event.
>> That will do what you want.
>> At least at first blush.
>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[EMAIL PROTECTED]> wrote:
>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>> per hour), I don't think your most optimal solution is to lookup HBase for
>>> every single event. You will benefit more by loading the HBase table
>>> directly in your MR job.
>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ?
>>> Also once you have done the unique, are you going to use the data again in
>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>> this is just to compute some unique #'s ?
>>> It will be more helpful if you describe your final use case of the computed
>>> data too. Given the amount of back and forth, we can take it off list too
>>> and summarize the conversation for the list.
>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:
>>>> We can't rely on the the assumption event dupes will not dupe outside an
>>>> hour boundary. So, your take is that, doing a lookup per event within the
>>>> MR job is going to be bad?
>>>> From: Viral Bajaria <[EMAIL PROTECTED]>
>>>> To: Rahul Ravindran <[EMAIL PROTECTED]>
>>>> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>> Subject: Re: Using HBase for Deduping
>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>> lookups. I don't think this is the best solution either given the # of
>>>> events you will get.
>>>> FWIW, the solution below again relies on the assumption that if a event is
>>>> duped in the same hour it won't have a dupe outside of that hour boundary.
>>>> If it can have then you are better of with running a MR job with the
>>>> current hour + another 3 hours of data or an MR job with the current hour +
>>>> the HBase table as input to the job too (i.e. no HBase lookups, just read
>>>> the HFile directly) ?
>>>> - Run a MR job which de-dupes events for the current hour i.e. only runs on
>>>> 1 hour worth of data.
>>>> - Mark records which you were not able to de-dupe in the current run
>>>> - For the records that you were not able to de-dupe, check against HBase
>>>> whether you saw that event in the past. If you did, you can drop the
Michael Segel | (m) 312.755.9623
Segel and Associates