Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Using HBase for Deduping


Copy link to this message
-
Re: Using HBase for Deduping
Well,
Maybe its a lack of sleep, but this is what I found...
 checkAndPut

public boolean checkAndPut(byte[] row,
                           byte[] family,
                           byte[] qualifier,
                           byte[] value,
                           Put put)
                    throws IOException
Atomically checks if a row/family/qualifier value matches the expected value. If it does, it adds the put. If the passed value is null, the check is for the lack of column (ie: non-existance)

Specified by:
checkAndPut in interface HTableInterface
Parameters:
row - to check
family - column family to check
qualifier - column qualifier to check
value - the expected value
put - data to put if check succeeds
Returns:
true if the new put was executed, false otherwise
Throws:
IOException - e
Maybe I'm reading it wrong?

But hey! What do I know? Its Valentine's Day and I'm spending my evening answering questions sitting in my man cave instead of spending it with my wife.
Its no wonder I live in the perpetual dog house! :-P
On Feb 14, 2013, at 7:35 PM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:

> Checkandput() does not work when the row does not exist, or am I missing something?
>
> Sent from my phone.Excuse the terseness.
>
> On Feb 14, 2013, at 5:33 PM, Michael Segel <[EMAIL PROTECTED]> wrote:
>
>> What constitutes a duplicate?
>>
>> An over simplification is to do a HTable.checkAndPut() where you do the put if the column doesn't exist.
>> Then if the row is inserted (TRUE) return value, you push the event.
>>
>> That will do what you want.
>>
>> At least at first blush.
>>
>>
>>
>> On Feb 14, 2013, at 3:24 PM, Viral Bajaria <[EMAIL PROTECTED]> wrote:
>>
>>> Given the size of the data (> 1B rows) and the frequency of job run (once
>>> per hour), I don't think your most optimal solution is to lookup HBase for
>>> every single event. You will benefit more by loading the HBase table
>>> directly in your MR job.
>>>
>>> In 1B rows, what's the cardinality ? Is it 100M UUID's ? 99% unique UUID's ?
>>>
>>> Also once you have done the unique, are you going to use the data again in
>>> some other way i.e. online serving of traffic or some other analysis ? Or
>>> this is just to compute some unique #'s ?
>>>
>>> It will be more helpful if you describe your final use case of the computed
>>> data too. Given the amount of back and forth, we can take it off list too
>>> and summarize the conversation for the list.
>>>
>>> On Thu, Feb 14, 2013 at 1:07 PM, Rahul Ravindran <[EMAIL PROTECTED]> wrote:
>>>
>>>> We can't rely on the the assumption event dupes will not dupe outside an
>>>> hour boundary. So, your take is that, doing a lookup per event within the
>>>> MR job is going to be bad?
>>>>
>>>>
>>>> ________________________________
>>>> From: Viral Bajaria <[EMAIL PROTECTED]>
>>>> To: Rahul Ravindran <[EMAIL PROTECTED]>
>>>> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>>>> Sent: Thursday, February 14, 2013 12:48 PM
>>>> Subject: Re: Using HBase for Deduping
>>>>
>>>> You could do with a 2-pronged approach here i.e. some MR and some HBase
>>>> lookups. I don't think this is the best solution either given the # of
>>>> events you will get.
>>>>
>>>> FWIW, the solution below again relies on the assumption that if a event is
>>>> duped in the same hour it won't have a dupe outside of that hour boundary.
>>>> If it can have then you are better of with running a MR job with the
>>>> current hour + another 3 hours of data or an MR job with the current hour +
>>>> the HBase table as input to the job too (i.e. no HBase lookups, just read
>>>> the HFile directly) ?
>>>>
>>>> - Run a MR job which de-dupes events for the current hour i.e. only runs on
>>>> 1 hour worth of data.
>>>> - Mark records which you were not able to de-dupe in the current run
>>>> - For the records that you were not able to de-dupe, check against HBase
>>>> whether you saw that event in the past. If you did, you can drop the

Michael Segel  | (m) 312.755.9623

Segel and Associates
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB