Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Using HBase for Deduping


Copy link to this message
-
Re: Using HBase for Deduping
I could surround with a Try..Catch, but that would each time I insert a UUID for the first time (99% of the time), I would do a checkAndPut(), catch the resultant exception and perform a Put; so, 2 operations each reduce invocation, which is what I was looking to avoid
________________________________
 From: Michael Segel <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Rahul Ravindran <[EMAIL PROTECTED]>
Sent: Friday, February 15, 2013 9:24 AM
Subject: Re: Using HBase for Deduping
 

Interesting. 

Surround with a Try Catch? 

But it sounds like you're on the right path. 

Happy Coding!
On Feb 15, 2013, at 11:12 AM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:

I had tried checkAndPut yesterday with a null passed as the value and it had thrown an exception when the row did not exist. Perhaps, I was doing something wrong. Will try that again, since, yes, I would prefer a checkAndPut().
>
>
>________________________________
>From: Michael Segel <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Cc: Rahul Ravindran <[EMAIL PROTECTED]>
>Sent: Friday, February 15, 2013 4:36 AM
>Subject: Re: Using HBase for Deduping
>
>
>On Feb 15, 2013, at 3:07 AM, Asaf Mesika <[EMAIL PROTECTED]> wrote:
>
>
>Michael, this means read for every write?
>>
>>Yes and no.
>
>At the macro level, a read for every write would mean that your client would read a record from HBase, and then based on some logic it would either write a record, or not.
>
>So that you have a lot of overhead in the initial get() and then put().
>
>At this macro level, with a Check and Put you have less overhead because of a single message to HBase.
>
>Intermal to HBase, you would still have to check the value in the row, if it exists and then perform an insert or not.
>
>WIth respect to your billion events an hour...
>
>dividing by 3600 to get the number of events in a second. You would have less than 300,000 events a second.
>
>What exactly are you doing and how large are those events?
>
>Since you are processing these events in a batch job, timing doesn't appear to be that important and of course there is also async hbase which may improve some of the performance.
>
>YMMV but this is a good example of the checkAndPut()
>
>
>
>
>On Friday, February 15, 2013, Michael Segel wrote:
>>
>>
>>What constitutes a duplicate?
>>>
>>>An over simplification is to do a HTable.checkAndPut() where you do the
>>>put if the column doesn't exist.
>>>Then if the row is inserted (TRUE) return value, you push the event.
>>>
>>>That will do what you want.
>>>
>>>At least at first blush.
>>>
>>>
>>>
>>>On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[EMAIL PROTECTED]>
>>>wrote:
>>>
>>>
>>>Given the size of the data (> 1B rows) and the frequency of job run (once
>>>>per hour), I don't think your most optimal solution is to lookup HBase
>>>>for
>>>
>>>every single event. You will benefit more by loading the HBase table
>>>>directly in your MR job.
>>>>
>>>>In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique
>>>>UUID's ?
>>>
>>>
>>>>Also once you have done the unique, are you going to use the data again
>>>>in
>>>
>>>some other way i.e. online serving of traffic or some other analysis ? Or
>>>>this is just to compute some unique #'s ?
>>>>
>>>>It will be more helpful if you describe your final use case of the
>>>>computed
>>>
>>>data too. Given the amount of back and forth, we can take it off list too
>>>>and summarize the conversation for the list.
>>>>
>>>>On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[EMAIL PROTECTED]>
>>>>wrote:
>>>
>>>
>>>>
>>>>We can't rely on the the assumption event dupes will not dupe outside an
>>>>>hour boundary. So, your take is that, doing a lookup per event within
>>>>>the
>>>
>>>MR job is going to be bad?
>>>>>
>>>>>
>>>>>________________________________
>>>>>From: Viral Bajaria <[EMAIL PROTECTED]>
>>>>>To: Rahul Ravindran <[EMAIL PROTECTED]>
>>>>>Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>>>>Sent: Thursday, February 14, 2013 12:48 PM

Michael Segel  | (m) 312.755.9623
Segel and Associates
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB