Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Optimal setup for regular purging of old rows

Copy link to this message
Optimal setup for regular purging of old rows

For some reason there are suddenly lots of questions about purging old data.  
I'm looking at the same thing and was wondering:

* In my case, the same table is shared by multiple users, each of which may have
a different data retention policy.  Thus, I think I need to look at each and
every row and check if it's considered "expired" and thus ready for deletion.  
Ideally, I'd associate a TTL when I Put a row and HBase would automagically
remove it when its time is up, but I don't think TTLs per row are doable, and
neither is automagical expiration, right?

* Is the only option to have a column with the expiration timestamp, and have a
nightly MR job that does a full table scan and purges all expired rows?  
Wouldn't that be *super* costly because *all* data would have to be read from
disk just for this one thing?  And this would evict all good stuff from the OS
cache (and maybe block cache and memstore?)  Is there a better way?

* Are there specific recommendations for how to define tables to be able  to
efficiently remove batches of rows on a regular basis?

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/