Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Optimal setup for regular purging of old rows

Copy link to this message
Re: Optimal setup for regular purging of old rows

> > * In my case, the same table is shared by multiple  users, each of which may

> > a different data retention policy.  Thus,  I think I need to look at each
> > every row and check if it's  considered "expired" and thus ready for
> > Ideally, I'd associate  a TTL when I Put a row and HBase would automagically
> > remove it when its  time is up, but I don't think TTLs per row are doable,
> > neither is  automagical expiration, right?
> >
> TTLs are per column family though  the TTL you talk of above seems
> different than the CF TTL.  You want a  row-based TTL?

Right.  Imagine I have 3 users that each have different TTL for their data.
Then I think I need something like this (key, data, expiration date):

user1_key1    data   2011-04-01
user2_key1    data   2022-01-01
user3_key1    data   -1 (say that -1 means "never expire - keep")

> > * Is the only option to have a column with the  expiration timestamp, and
>have a
> > nightly MR job that does a full table  scan and purges all expired rows?
> I don't know any other way.
> >  Wouldn't that be *super* costly because *all* data would have to be read  
> > disk just for this one thing?
> Yeah, it'd be costly  unless you added this 'meta' info into a separate CF.

Right, separate CF.

> > And this would  evict all good stuff from the OS
> > cache (and maybe block cache and  memstore?)  Is there a better way?
> >
> Not from blockcache or  memstore.  Scans usually by-pass blockcache
> IIRC (If I have it wrong  here, I know there is a flag to set on Scans
> to say whether to go via  blockcache or not).

OK, good :)
And I suppose the OS dirtying is minimized by having a CF with *just* this
date/timestamp Column in it?

> > * Are there specific recommendations for how  to define tables to be able
> > efficiently remove batches of rows on a  regular basis?
> >
> We're used to TTLs or max versions setting on the  column family
> schema.  You want something more exotic Otis.
> Why  remove the data at all?  Or why not just let hbase do its  TTL
> cleanup.  Is it space you are worried about?

Yeah, it's about the space and cost associated with it.
I'd love to let HBase do its TTL, but it looks like a single TTL is for all rows
in a given CF, which means that I'd have to have my different users use
different CFs instead of having their data in the same CF.  Is that what you