|
|
-
Eliminating duplicate values
Matt Corgan 2013-03-03, 03:42
I have a few use cases where I'd like to leverage HBase's high write throughput to blindly write lots of data even if most of it hasn't changed since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE, however, I don't want to keep all the duplicate copies around forever. At compaction time, I'd like the compactor to compare the values of cells with the same row/family/qualifier and only keep the *oldest* version of duplicates. By keeping the oldest versions I can get a snapshot of a row at any historical time.
Lars, I think you said Salesforce retains many versions of cells - do you retain all the duplicates?
I'm guessing co-processors would be the solution and am looking for some pointers on the cleanest way to implement it or some code if anyone has already solved the problem.
I'm also wondering if people think it's a generic enough use case that HBase could support it natively, say, with a column family attribute DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage at compaction time because of all the value comparisons.
Thanks for any tips, Matt
-
Re: Eliminating duplicate values
Michel Segel 2013-03-03, 05:09
There are no duplicates. Cells have versions, which are time stamped. You could set the number of versions to one... But I'd recommend sticking w the default 3.
Sent from a remote device. Please excuse any typos...
Mike Segel
On Mar 2, 2013, at 9:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> I have a few use cases where I'd like to leverage HBase's high write > throughput to blindly write lots of data even if most of it hasn't changed > since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE, > however, I don't want to keep all the duplicate copies around forever. At > compaction time, I'd like the compactor to compare the values of cells with > the same row/family/qualifier and only keep the *oldest* version of > duplicates. By keeping the oldest versions I can get a snapshot of a row > at any historical time. > > Lars, I think you said Salesforce retains many versions of cells - do you > retain all the duplicates? > > I'm guessing co-processors would be the solution and am looking for some > pointers on the cleanest way to implement it or some code if anyone has > already solved the problem. > > I'm also wondering if people think it's a generic enough use case that > HBase could support it natively, say, with a column family attribute > DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage at > compaction time because of all the value comparisons. > > Thanks for any tips, > Matt
-
Re: Eliminating duplicate values
sriraam h 2013-03-03, 06:13
Where I work, we deal with loads of telecom data. We dint want to deal with duplicates after loading;so, we handled versions ourself. The combination of rowkey, column family and the generated column name(column name is different for different entries) and the timestamp we generated from the values(we had one to one mapping with a value. Maybe others could use some sort of a function mapping) made each record unique. So , reloading of data will simply overwrite the original data.
I am not sure if this approach is possible and/or desirable for other types of data/ data models.
thanks, Sriraam
>________________________________ > From: Matt Corgan <[EMAIL PROTECTED]> >To: hbase-user <[EMAIL PROTECTED]> >Sent: Sunday, 3 March 2013, 9:12 >Subject: Eliminating duplicate values > >I have a few use cases where I'd like to leverage HBase's high write >throughput to blindly write lots of data even if most of it hasn't changed >since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE, >however, I don't want to keep all the duplicate copies around forever. At >compaction time, I'd like the compactor to compare the values of cells with >the same row/family/qualifier and only keep the *oldest* version of >duplicates. By keeping the oldest versions I can get a snapshot of a row >at any historical time. > >Lars, I think you said Salesforce retains many versions of cells - do you >retain all the duplicates? > >I'm guessing co-processors would be the solution and am looking for some >pointers on the cleanest way to implement it or some code if anyone has >already solved the problem. > >I'm also wondering if people think it's a generic enough use case that >HBase could support it natively, say, with a column family attribute >DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage at >compaction time because of all the value comparisons. > >Thanks for any tips, >Matt > > >
-
Re: Eliminating duplicate values
Anoop John 2013-03-03, 15:37
Matt Corgan I remember, some one else also sent mail some days back looking for same use case Yes CP can help. May be do deletion of duplicates at Major compact time?
-Anoop-
On Sun, Mar 3, 2013 at 9:12 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> I have a few use cases where I'd like to leverage HBase's high write > throughput to blindly write lots of data even if most of it hasn't changed > since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE, > however, I don't want to keep all the duplicate copies around forever. At > compaction time, I'd like the compactor to compare the values of cells with > the same row/family/qualifier and only keep the *oldest* version of > duplicates. By keeping the oldest versions I can get a snapshot of a row > at any historical time. > > Lars, I think you said Salesforce retains many versions of cells - do you > retain all the duplicates? > > I'm guessing co-processors would be the solution and am looking for some > pointers on the cleanest way to implement it or some code if anyone has > already solved the problem. > > I'm also wondering if people think it's a generic enough use case that > HBase could support it natively, say, with a column family attribute > DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage at > compaction time because of all the value comparisons. > > Thanks for any tips, > Matt >
-
Re: Eliminating duplicate values
Tom Brown 2013-03-03, 15:48
If you're doing comparisons to remove duplicates, I'm not sure if you'd get any benefit to doing the de-duplication at compaction time.
If you de-duplicate at write time, the same number of comparisons would have to be made. There will be fewer disk writes (no duplicate data is written) but probably more random reads (though it could benefit from caching, depending on your dataset), but also the size of data to compact will be smaller.
Just my $0.02...
--Tom
On Sunday, March 3, 2013, Anoop John wrote:
> Matt Corgan > I remember, some one else also sent mail some days back > looking for same use case > Yes CP can help. May be do deletion of duplicates at Major compact time? > > -Anoop- > > On Sun, Mar 3, 2013 at 9:12 AM, Matt Corgan <[EMAIL PROTECTED]<javascript:;>> > wrote: > > > I have a few use cases where I'd like to leverage HBase's high write > > throughput to blindly write lots of data even if most of it hasn't > changed > > since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE, > > however, I don't want to keep all the duplicate copies around forever. > At > > compaction time, I'd like the compactor to compare the values of cells > with > > the same row/family/qualifier and only keep the *oldest* version of > > duplicates. By keeping the oldest versions I can get a snapshot of a row > > at any historical time. > > > > Lars, I think you said Salesforce retains many versions of cells - do you > > retain all the duplicates? > > > > I'm guessing co-processors would be the solution and am looking for some > > pointers on the cleanest way to implement it or some code if anyone has > > already solved the problem. > > > > I'm also wondering if people think it's a generic enough use case that > > HBase could support it natively, say, with a column family attribute > > DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage > at > > compaction time because of all the value comparisons. > > > > Thanks for any tips, > > Matt > > >
|
|