> Patrick raised an issue that might be of concern... region splits.
Right. And if I understand correctly, if I want to have multiple CFs that grow
unevenly, these region splits are something I have to then be willing to accept.
> But barring that... what makes the most sense on retention policies?
> The point is that its a business issue that will be driving the logic.
The exact business requirement is not defined yet. Say 3-4 retention policies.
> Depending on a clarification from Patrick or JGray or JDCryans... you may want
>to consider separate tables using the same key.
> You could also use a single table and run a sweeper every night that deletes
>the rows, and then do a major compaction after hours.
> (Again you would have to account for the maintenance window.)
Right. The reason I am even thinking about CF-per-retention-policy is because I
am afraid of a big and expensive nightly scan-and-delete. That said, I don't
actually know how and if this scan will be expensive. So I'm trying to
understand pros and cons ahead of time. Maybe I'm prematurely optimizing, but
since this feels like a big structural/architectural change, I thought it would
be worth "getting it right" before I have lots of tenants and their data in the
Thank you everyone!
> > Date: Thu, 17 Mar 2011 10:38:11 -0700
> > From: [EMAIL PROTECTED]
> > Subject: Re: Suggested and max number of CFs per table
> > To: [EMAIL PROTECTED]
> > Hi,
> > > Patrick,
> > >
> > > Perhaps I misunderstood Otis' design.
> > >
> > > I thought he'd create the CF based on duration.
> > > So you could have a CF for (daily, weekly, monthly, annual, indefinite).
> > > So that you set up the table once with all CFs.
> > > Then you'd write the data to one and only one of those buckets.
> > That's right.
> > > The only time you'd have a problem is if you have a tenant who switches
> > >retention policy.
> > >
> > > Although you could move data still in a CF so that you still only query
> > >for data.
> > That's right. Say a tenant decides to switch from keeping his data for 1
> > to keeping it for 6 months.
> > Then we'd have to:
> > 1) start writing new data for this tenant to the 6-month CF
> > 2) copy this tenant's old data from 1-month CF to the 6-month CF
> > 3) purge/delete old data for this tenant from 1-month CF
> > If the tenant wants to go from 6-months to 1-month then we'd additionally
> > to limit copying in step 2) above to just the last 1 month of data and drop
> > rest.
> > To answer Mike's questions from his other reply:
> > > What's the data access patterns? Are they discrete between tenants?
> > > As long as the data access is discrete between tenants and the tenants
> > >only one bucket, you can do what you suggest.
> > Yes, data for a given tenant would be written to just 1 of those CFs.
> > > But here's something to consider...
> > > You are going to want to know your tenant's retention policy before you
> > >attempt to get the data. This means you read from one column family when
> > >your get() and not all of them, right? ;-)
> > Yes, when reading the data I'd know the tenant's retention policy and based
> > that I'd know from which CF to get the data.
> > So my question here is: How many such CFs would it be wise to have? 2? 3?
> > > With respect to your discussion on region splits..
> > > So you're saying that if one CF splits then all of the CFs are affected
> > >split as well?
> > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and
> > compactions, not splits, but from what I understand flushes can trigger
> > because they increase the aggregate size of MapFiles, which at some point
> > Region splitting. Please correct me if I'm wrong. :)