> Perhaps I misunderstood Otis' design.
> I thought he'd create the CF based on duration.
> So you could have a CF for (daily, weekly, monthly, annual, indefinite).
> So that you set up the table once with all CFs.
> Then you'd write the data to one and only one of those buckets.
> The only time you'd have a problem is if you have a tenant who switches their
> Although you could move data still in a CF so that you still only query one CF
That's right. Say a tenant decides to switch from keeping his data for 1 month
to keeping it for 6 months.
Then we'd have to:
1) start writing new data for this tenant to the 6-month CF
2) copy this tenant's old data from 1-month CF to the 6-month CF
3) purge/delete old data for this tenant from 1-month CF
If the tenant wants to go from 6-months to 1-month then we'd additionally want
to limit copying in step 2) above to just the last 1 month of data and drop the
To answer Mike's questions from his other reply:
> What's the data access patterns? Are they discrete between tenants?
> As long as the data access is discrete between tenants and the tenants write to
>only one bucket, you can do what you suggest.
Yes, data for a given tenant would be written to just 1 of those CFs.
> But here's something to consider...
> You are going to want to know your tenant's retention policy before you
>attempt to get the data. This means you read from one column family when you do
>your get() and not all of them, right? ;-)
Yes, when reading the data I'd know the tenant's retention policy and based on
that I'd know from which CF to get the data.
So my question here is: How many such CFs would it be wise to have? 2? 3? 6?
> With respect to your discussion on region splits..
> So you're saying that if one CF splits then all of the CFs are affected and
>split as well?
http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and
compactions, not splits, but from what I understand flushes can trigger splits
because they increase the aggregate size of MapFiles, which at some point causes
Region splitting. Please correct me if I'm wrong. :)
So this is also what I wanted to verify. As you can imagine, there's likely be
more tenants with 1-month data retention policy than 1-year or "forever" data
retention. So that 1-month CF will grow much more quickly and if I understand
the above section in HBase book correctly, it means that it will cause all other
CFs' files to split (even if they are not big enough yet), which means more disk
and network IO.
That is, if all those CFs are in the same table. If they are in different
tables then this would not happen?
> > Date: Thu, 17 Mar 2011 11:26:35 -0400
> > Subject: Re: Suggested and max number of CFs per table
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > CC: [EMAIL PROTECTED]
> > Otis,
> > Perhaps your biggest issue will be the need to disable the table to add a
> > new CF. So effectively you need to bring down the application to move in a
> > new tenant.
> > Another thing with multiple CFs is that if one CF tends to get
> > disproportionally more data, you will get a lot of region splitting, and
> > other CFs will have HFiles for a region that are very small.
> > I think the only reasonable use of CFs is if you really need row-level
> > atomicity across CFs. Otherwise just use multiple tables.
> > On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic <
> > [EMAIL PROTECTED]> wrote:
> > > Hi,
> > >
> > > My Q is around the suggested or maximum number of CFs per table (see
> > > http://hbase.apache.org/book/schema.html#number.of.cfs )
> > >
> > > Consider the following use-case.
> > > * A multi-tenant system.
> > > * All tenants write data to the same table.
> > > * Tenants have different data retention policies.
> > >
> > > For the above use case I thought one could then just have different CFs