There are no duplicates.
Cells have versions, which are time stamped. You could set the number of versions to one... But I'd recommend sticking w the default 3.
Sent from a remote device. Please excuse any typos...
On Mar 2, 2013, at 9:42 PM, Matt Corgan <[EMAIL PROTECTED]> wrote:
> I have a few use cases where I'd like to leverage HBase's high write
> throughput to blindly write lots of data even if most of it hasn't changed
> since the last write. I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
> however, I don't want to keep all the duplicate copies around forever. At
> compaction time, I'd like the compactor to compare the values of cells with
> the same row/family/qualifier and only keep the *oldest* version of
> duplicates. By keeping the oldest versions I can get a snapshot of a row
> at any historical time.
> Lars, I think you said Salesforce retains many versions of cells - do you
> retain all the duplicates?
> I'm guessing co-processors would be the solution and am looking for some
> pointers on the cleanest way to implement it or some code if anyone has
> already solved the problem.
> I'm also wondering if people think it's a generic enough use case that
> HBase could support it natively, say, with a column family attribute
> DISCARD_NEWEST_DUPLICATE=true/false. The cost would be higher CPU usage at
> compaction time because of all the value comparisons.
> Thanks for any tips,