-Re: Optimal setup for regular purging of old rows
Otis Gospodnetic 2011-03-10, 04:27
> > * In my case, the same table is shared by multiple users, each of which may
> > a different data retention policy. Thus, I think I need to look at each
> > every row and check if it's considered "expired" and thus ready for
> > Ideally, I'd associate a TTL when I Put a row and HBase would automagically
> > remove it when its time is up, but I don't think TTLs per row are doable,
> > neither is automagical expiration, right?
> TTLs are per column family though the TTL you talk of above seems
> different than the CF TTL. You want a row-based TTL?
Right. Imagine I have 3 users that each have different TTL for their data.
Then I think I need something like this (key, data, expiration date):
user1_key1 data 2011-04-01
user2_key1 data 2022-01-01
user3_key1 data -1 (say that -1 means "never expire - keep")
> > * Is the only option to have a column with the expiration timestamp, and
> > nightly MR job that does a full table scan and purges all expired rows?
> I don't know any other way.
> > Wouldn't that be *super* costly because *all* data would have to be read
> > disk just for this one thing?
> Yeah, it'd be costly unless you added this 'meta' info into a separate CF.
Right, separate CF.
> > And this would evict all good stuff from the OS
> > cache (and maybe block cache and memstore?) Is there a better way?
> Not from blockcache or memstore. Scans usually by-pass blockcache
> IIRC (If I have it wrong here, I know there is a flag to set on Scans
> to say whether to go via blockcache or not).
OK, good :)
And I suppose the OS dirtying is minimized by having a CF with *just* this
date/timestamp Column in it?
> > * Are there specific recommendations for how to define tables to be able
> > efficiently remove batches of rows on a regular basis?
> We're used to TTLs or max versions setting on the column family
> schema. You want something more exotic Otis.
> Why remove the data at all? Or why not just let hbase do its TTL
> cleanup. Is it space you are worried about?
Yeah, it's about the space and cost associated with it.
I'd love to let HBase do its TTL, but it looks like a single TTL is for all rows
in a given CF, which means that I'd have to have my different users use
different CFs instead of having their data in the same CF. Is that what you