Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Eliminating duplicate values


Copy link to this message
-
Eliminating duplicate values
I have a few use cases where I'd like to leverage HBase's high write
throughput to blindly write lots of data even if most of it hasn't changed
since the last write.  I want to retain MAX_VERSIONS=Integer.MAX_VALUE,
however, I don't want to keep all the duplicate copies around forever.  At
compaction time, I'd like the compactor to compare the values of cells with
the same row/family/qualifier and only keep the *oldest* version of
duplicates.  By keeping the oldest versions I can get a snapshot of a row
at any historical time.

Lars, I think you said Salesforce retains many versions of cells - do you
retain all the duplicates?

I'm guessing co-processors would be the solution and am looking for some
pointers on the cleanest way to implement it or some code if anyone has
already solved the problem.

I'm also wondering if people think it's a generic enough use case that
HBase could support it natively, say, with a column family attribute
DISCARD_NEWEST_DUPLICATE=true/false.  The cost would be higher CPU usage at
compaction time because of all the value comparisons.

Thanks for any tips,
Matt
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB