Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Using HBase for Deduping


Copy link to this message
-
Re: Using HBase for Deduping
I could surround with a Try..Catch, but that would each time I insert a UUID for the first time (99% of the time), I would do a checkAndPut(), catch the resultant exception and perform a Put; so, 2 operations each reduce invocation, which is what I was looking to avoid
________________________________
 From: Michael Segel <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Rahul Ravindran <[EMAIL PROTECTED]>
Sent: Friday, February 15, 2013 9:24 AM
Subject: Re: Using HBase for Deduping
 

Interesting. 

Surround with a Try Catch? 

But it sounds like you're on the right path. 

Happy Coding!
On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:

I had tried checkAndPut yesterday with a null passed as the value and it had thrown an exception when the row did not exist. Perhaps, I was doing something wrong. Will try that again, since, yes, I would prefer a checkAndPut().
>
>
>________________________________
>From: Michael Segel <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Cc: Rahul Ravindran <[EMAIL PROTECTED]>
>Sent: Friday, February 15, 2013 4:36 AM
>Subject: Re: Using HBase for Deduping
>
>
>On Feb 15, 2013, at 3:07 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote:
>
>
>Michael, this means read for every write?
>>
>>Yes and no.
>
>At the macro level, a read for every write would mean that your client would read a record from HBase, and then based on some logic it would either write a record, or not.
>
>So that you have a lot of overhead in the initial get() and then put().
>
>At this macro level, with a Check and Put you have less overhead because of a single message to HBase.
>
>Intermal to HBase, you would still have to check the value in the row, if it exists and then perform an insert or not.
>
>WIth respect to your billion events an hour...
>
>dividing by 3600 to get the number of events in a second. You would have less than 300,000 events a second.
>
>What exactly are you doing and how large are those events?
>
>Since you are processing these events in a batch job, timing doesn't appear to be that important and of course there is also async hbase which may improve some of the performance.
>
>YMMV but this is a good example of the checkAndPut()
>
>
>
>
>On Friday, February 15, 2013, Michael Segel wrote:
>>
>>
>>What constitutes a duplicate?
>>>
>>>An over simplification is to do a HTable.checkAndPut() where you do the
>>>put if the column doesn't exist.
>>>Then if the row is inserted (TRUE) return value, you push the event.
>>>
>>>That will do what you want.
>>>
>>>At least at first blush.
>>>
>>>
>>>
>>>On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[EMAIL PROTECTED]>
>>>wrote:
>>>
>>>
>>>Given the size of the data (> 1B rows) and the frequency of job run (once
>>>>per hour), I don't think your most optimal solution is to lookup HBase
>>>>for
>>>
>>>every single event. You will benefit more by loading the HBase table
>>>>directly in your MR job.
>>>>
>>>>In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
>>>>UUID's ?
>>>
>>>
>>>>Also once you have done the unique, are you going to use the data again
>>>>in
>>>
>>>some other way i.e. online serving of traffic or some other analysis ? Or
>>>>this is just to compute some unique #'s ?
>>>>
>>>>It will be more helpful if you describe your final use case of the
>>>>computed
>>>
>>>data too. Given the amount of back and forth, we can take it off list too
>>>>and summarize the conversation for the list.
>>>>
>>>>On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[EMAIL PROTECTED]>
>>>>wrote:
>>>
>>>
>>>>
>>>>We can't rely on the the assumption event dupes will not dupe outside an
>>>>>hour boundary. So, your take is that, doing a lookup per event within
>>>>>the
>>>
>>>MR job is going to be bad?
>>>>>
>>>>>
>>>>>________________________________
>>>>>From: Viral Bajaria <[EMAIL PROTECTED]>
>>>>>To: Rahul Ravindran <[EMAIL PROTECTED]>
>>>>>Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>>>>Sent: Thursday, February 14, 2013 12:48 PM

Michael Segel  | (m) 312.755.9623
Segel and Associates