|
|
-
Re: Suggested and max number of CFs per tableOtis Gospodnetic 2011-03-17, 21:44
Hi,
> Patrick raised an issue that might be of concern... region splits. Right. And if I understand correctly, if I want to have multiple CFs that grow unevenly, these region splits are something I have to then be willing to accept. > But barring that... what makes the most sense on retention policies? > > The point is that its a business issue that will be driving the logic. The exact business requirement is not defined yet. Say 3-4 retention policies. > Depending on a clarification from Patrick or JGray or JDCryans... you may want >to consider separate tables using the same key. > You could also use a single table and run a sweeper every night that deletes >the rows, and then do a major compaction after hours. > (Again you would have to account for the maintenance window.) Right. The reason I am even thinking about CF-per-retention-policy is because I am afraid of a big and expensive nightly scan-and-delete. That said, I don't actually know how and if this scan will be expensive. So I'm trying to understand pros and cons ahead of time. Maybe I'm prematurely optimizing, but since this feels like a big structural/architectural change, I thought it would be worth "getting it right" before I have lots of tenants and their data in the system. Thank you everyone! Otis > HTH > > -Mike > > > > Date: Thu, 17 Mar 2011 10:38:11 -0700 > > From: [EMAIL PROTECTED] > > Subject: Re: Suggested and max number of CFs per table > > To: [EMAIL PROTECTED] > > > > Hi, > > > > > > > Patrick, > > > > > > Perhaps I misunderstood Otis' design. > > > > > > I thought he'd create the CF based on duration. > > > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > > > So that you set up the table once with all CFs. > > > Then you'd write the data to one and only one of those buckets. > > > > That's right. > > > > > The only time you'd have a problem is if you have a tenant who switches >their > > > >retention policy. > > > > > > Although you could move data still in a CF so that you still only query >one CF > > > >for data. > > > > That's right. Say a tenant decides to switch from keeping his data for 1 >month > > > to keeping it for 6 months. > > Then we'd have to: > > 1) start writing new data for this tenant to the 6-month CF > > 2) copy this tenant's old data from 1-month CF to the 6-month CF > > 3) purge/delete old data for this tenant from 1-month CF > > > > If the tenant wants to go from 6-months to 1-month then we'd additionally >want > > > to limit copying in step 2) above to just the last 1 month of data and drop >the > > > rest. > > > > To answer Mike's questions from his other reply: > > > > > What's the data access patterns? Are they discrete between tenants? > > > As long as the data access is discrete between tenants and the tenants >write to > > > >only one bucket, you can do what you suggest. > > > > Yes, data for a given tenant would be written to just 1 of those CFs. > > > > > But here's something to consider... > > > You are going to want to know your tenant's retention policy before you > > >attempt to get the data. This means you read from one column family when >you do > > > >your get() and not all of them, right? ;-) > > > > Yes, when reading the data I'd know the tenant's retention policy and based >on > > > that I'd know from which CF to get the data. > > > > > > So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > > > > > > > With respect to your discussion on region splits.. > > > So you're saying that if one CF splits then all of the CFs are affected >and > > > >split as well? > > > > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and > > compactions, not splits, but from what I understand flushes can trigger >splits > > > because they increase the aggregate size of MapFiles, which at some point >causes > > > Region splitting. Please correct me if I'm wrong. :) |