Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Eliminating duplicate values


+
Matt Corgan 2013-03-03, 03:42
+
sriraam h 2013-03-03, 06:13
Copy link to this message
-
Re: Eliminating duplicate values
Matt Corgan
                 I remember, some one else also sent mail some days back
looking for same use case
Yes CP can help. May be do deletion of duplicates at Major compact time?

-Anoop-

On Sun, Mar 3, 2013 at 9:12 AM, Matt Corgan <[EMAIL PROTECTED]> wrote:

> I have a few use cases where I'd like to leverage HBase's high write
> throughput to blindly write lots of data even if most of it hasn't changed
> since the last write.  I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
> however, I don't want to keep all the duplicate copies around forever.  At
> compaction time, I'd like the compactor to compare the values of cells with
> the same row/family/qualifier and only keep the *oldest* version of
> duplicates.  By keeping the oldest versions I can get a snapshot of a row
> at any historical time.
>
> Lars, I think you said Salesforce retains many versions of cells - do you
> retain all the duplicates?
>
> I'm guessing co-processors would be the solution and am looking for some
> pointers on the cleanest way to implement it or some code if anyone has
> already solved the problem.
>
> I'm also wondering if people think it's a generic enough use case that
> HBase could support it natively, say, with a column family attribute
> DISCARD_NEWEST_DUPLICATE=true/false.  The cost would be higher CPU usage at
> compaction time because of all the value comparisons.
>
> Thanks for any tips,
> Matt
>
+
Tom Brown 2013-03-03, 15:48
+
Michel Segel 2013-03-03, 05:09