Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Suggested and max number of CFs per table


Copy link to this message
-
Re: Suggested and max number of CFs per table
Otis Gospodnetic 2011-03-17, 21:44
Hi,
> Patrick raised an issue that might be of concern... region  splits.

Right.  And if I understand correctly, if I want to have multiple CFs that grow
unevenly, these region splits are something I have to then be willing to accept.

> But barring that... what makes the most sense on retention  policies?
>
> The point is that its a business issue that will be driving the  logic.

The exact business requirement is not defined yet.  Say 3-4 retention policies.

> Depending on a clarification from Patrick or JGray or JDCryans...  you may want
>to consider separate tables using the same key.
> You could also  use a single table and run a sweeper every night that deletes
>the rows, and then  do a major compaction after hours.
> (Again you would have to account for the  maintenance window.)

Right.  The reason I am even thinking about CF-per-retention-policy is because I
am afraid of a big and expensive nightly scan-and-delete.  That said, I don't
actually know how and if this scan will be expensive.  So I'm trying to
understand pros and cons ahead of time.  Maybe I'm prematurely optimizing, but
since this feels like a big structural/architectural change, I thought it would
be worth "getting it right" before I have lots of tenants and their data in the
system.

Thank you everyone!

Otis
> HTH
>
> -Mike
>
>
> > Date: Thu, 17 Mar  2011 10:38:11 -0700
> > From: [EMAIL PROTECTED]
> >  Subject: Re: Suggested and max number of CFs per table
> > To: [EMAIL PROTECTED]
> >
> >  Hi,
> >
> >
> > > Patrick,
> > >
> > > Perhaps I  misunderstood Otis' design.
> > >
> > > I thought  he'd  create the CF based on duration.
> > > So you could have a CF for  (daily,  weekly, monthly, annual, indefinite).
> > > So that you set  up the table once with  all CFs.
> > > Then you'd write the data to  one and only one of those  buckets.
> >
> > That's right.
> >
> > > The only time you'd have a problem is if you have a tenant  who  switches
>their
>
> > >retention policy.
> > >
> >  > Although you could move data still in a CF  so that you still only  query
>one CF
>
> > >for data.
> >
> > That's right.  Say a  tenant decides to switch from keeping his data for 1
>month
>
> > to keeping  it for 6 months.
> > Then we'd have to:
> > 1) start writing new data  for this tenant to the 6-month CF
> > 2) copy this tenant's old data from  1-month CF to the 6-month CF
> > 3) purge/delete old data for this tenant  from 1-month CF
> >
> > If the tenant wants to go from 6-months to  1-month then we'd additionally
>want
>
> > to limit copying in step 2) above  to just the last 1 month of data and drop
>the
>
> > rest.
> >
> > To  answer Mike's questions from his other reply:
> >
> > > What's the  data access patterns? Are they discrete between tenants?
> > > As long as  the data access is discrete between tenants and the tenants
>write to
>
> >  >only one bucket, you can do what you suggest.
> >
> > Yes, data for  a given tenant would be written to just 1 of those CFs.
> >
> > >  But here's something to consider...
> > > You  are going to want to  know your tenant's retention policy before you  

> > >attempt to get  the data. This means you read from one column family when  
>you do
>
> >  >your get() and not all of them, right? ;-)
> >
> > Yes, when  reading the data I'd know the tenant's retention policy and based
>on
>
> >  that I'd know from which CF to get the data.
> >
> >
> > So my  question here is: How many such CFs would it be wise to have? 2? 3?
6?
> >
> >
> > > With respect to your  discussion on region  splits..
> > > So you're saying that if one CF splits then all  of  the CFs are affected
>and
>
> > >split as well?
> >
> > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and
> > compactions, not splits, but from what I understand flushes can trigger  
>splits
>
> > because they increase the aggregate size of MapFiles, which at  some point
>causes
>
> > Region splitting.  Please correct me if I'm  wrong. :)