Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Eliminating duplicate values


Copy link to this message
-
Re: Eliminating duplicate values
Anoop John 2013-03-03, 15:37
Matt Corgan
                 I remember, some one else also sent mail some days back
looking for same use case
Yes CP can help. May be do deletion of duplicates at Major compact time?

-Anoop-

On Sun, Mar 3, 2013 at 9:12 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:

> I have a few use cases where I'd like to leverage HBase's high write
> throughput to blindly write lots of data even if most of it hasn't changed
> since the last write.  I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
> however, I don't want to keep all the duplicate copies around forever.  At
> compaction time, I'd like the compactor to compare the values of cells with
> the same row/family/qualifier and only keep the *oldest* version of
> duplicates.  By keeping the oldest versions I can get a snapshot of a row
> at any historical time.
>
> Lars, I think you said Salesforce retains many versions of cells - do you
> retain all the duplicates?
>
> I'm guessing co-processors would be the solution and am looking for some
> pointers on the cleanest way to implement it or some code if anyone has
> already solved the problem.
>
> I'm also wondering if people think it's a generic enough use case that
> HBase could support it natively, say, with a column family attribute
> DISCARD_NEWEST_DUPLICATE=true/false.  The cost would be higher CPU usage at
> compaction time because of all the value comparisons.
>
> Thanks for any tips,
> Matt
>