Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Optimal setup for regular purging of old rows


Copy link to this message
-
Re: Optimal setup for regular purging of old rows
Hi,

> > * In my case, the same table is shared by multiple  users, each of which may
>have

> > a different data retention policy.  Thus,  I think I need to look at each
and
> > every row and check if it's  considered "expired" and thus ready for
>deletion.
> > Ideally, I'd associate  a TTL when I Put a row and HBase would automagically
> > remove it when its  time is up, but I don't think TTLs per row are doable,
>and
> > neither is  automagical expiration, right?
> >
>
> TTLs are per column family though  the TTL you talk of above seems
> different than the CF TTL.  You want a  row-based TTL?

Right.  Imagine I have 3 users that each have different TTL for their data.
Then I think I need something like this (key, data, expiration date):

user1_key1    data   2011-04-01
user2_key1    data   2022-01-01
user3_key1    data   -1 (say that -1 means "never expire - keep")

> > * Is the only option to have a column with the  expiration timestamp, and
>have a
> > nightly MR job that does a full table  scan and purges all expired rows?
>
> I don't know any other way.
>
> >  Wouldn't that be *super* costly because *all* data would have to be read  
>from
> > disk just for this one thing?
>
> Yeah, it'd be costly  unless you added this 'meta' info into a separate CF.

Right, separate CF.

> > And this would  evict all good stuff from the OS
> > cache (and maybe block cache and  memstore?)  Is there a better way?
> >
>
> Not from blockcache or  memstore.  Scans usually by-pass blockcache
> IIRC (If I have it wrong  here, I know there is a flag to set on Scans
> to say whether to go via  blockcache or not).

OK, good :)
And I suppose the OS dirtying is minimized by having a CF with *just* this
date/timestamp Column in it?

> > * Are there specific recommendations for how  to define tables to be able
 to
> > efficiently remove batches of rows on a  regular basis?
> >
>
> We're used to TTLs or max versions setting on the  column family
> schema.  You want something more exotic Otis.
>
> Why  remove the data at all?  Or why not just let hbase do its  TTL
> cleanup.  Is it space you are worried about?

Yeah, it's about the space and cost associated with it.
I'd love to let HBase do its TTL, but it looks like a single TTL is for all rows
in a given CF, which means that I'd have to have my different users use
different CFs instead of having their data in the same CF.  Is that what you
mean?

Thanks,
Otis