Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Suggested and max number of CFs per table


Copy link to this message
-
Re: Suggested and max number of CFs per table
Hi,
> Patrick raised an issue that might be of concern... region  splits.

Right.  And if I understand correctly, if I want to have multiple CFs that grow
unevenly, these region splits are something I have to then be willing to accept.

> But barring that... what makes the most sense on retention  policies?
>
> The point is that its a business issue that will be driving the  logic.

The exact business requirement is not defined yet.  Say 3-4 retention policies.

> Depending on a clarification from Patrick or JGray or JDCryans...  you may want
>to consider separate tables using the same key.
> You could also  use a single table and run a sweeper every night that deletes
>the rows, and then  do a major compaction after hours.
> (Again you would have to account for the  maintenance window.)

Right.  The reason I am even thinking about CF-per-retention-policy is because I
am afraid of a big and expensive nightly scan-and-delete.  That said, I don't
actually know how and if this scan will be expensive.  So I'm trying to
understand pros and cons ahead of time.  Maybe I'm prematurely optimizing, but
since this feels like a big structural/architectural change, I thought it would
be worth "getting it right" before I have lots of tenants and their data in the
system.

Thank you everyone!

Otis
> HTH
>
> -Mike
>
>
> > Date: Thu, 17 Mar  2011 10:38:11 -0700
> > From: [EMAIL PROTECTED]
> >  Subject: Re: Suggested and max number of CFs per table
> > To: [EMAIL PROTECTED]
> >
> >  Hi,
> >
> >
> > > Patrick,
> > >
> > > Perhaps I  misunderstood Otis' design.
> > >
> > > I thought  he'd  create the CF based on duration.
> > > So you could have a CF for  (daily,  weekly, monthly, annual, indefinite).
> > > So that you set  up the table once with  all CFs.
> > > Then you'd write the data to  one and only one of those  buckets.
> >
> > That's right.
> >
> > > The only time you'd have a problem is if you have a tenant  who  switches
>their
>
> > >retention policy.
> > >
> >  > Although you could move data still in a CF  so that you still only  query
>one CF
>
> > >for data.
> >
> > That's right.  Say a  tenant decides to switch from keeping his data for 1
>month
>
> > to keeping  it for 6 months.
> > Then we'd have to:
> > 1) start writing new data  for this tenant to the 6-month CF
> > 2) copy this tenant's old data from  1-month CF to the 6-month CF
> > 3) purge/delete old data for this tenant  from 1-month CF
> >
> > If the tenant wants to go from 6-months to  1-month then we'd additionally
>want
>
> > to limit copying in step 2) above  to just the last 1 month of data and drop
>the
>
> > rest.
> >
> > To  answer Mike's questions from his other reply:
> >
> > > What's the  data access patterns? Are they discrete between tenants?
> > > As long as  the data access is discrete between tenants and the tenants
>write to
>
> >  >only one bucket, you can do what you suggest.
> >
> > Yes, data for  a given tenant would be written to just 1 of those CFs.
> >
> > >  But here's something to consider...
> > > You  are going to want to  know your tenant's retention policy before you  

> > >attempt to get  the data. This means you read from one column family when  
>you do
>
> >  >your get() and not all of them, right? ;-)
> >
> > Yes, when  reading the data I'd know the tenant's retention policy and based
>on
>
> >  that I'd know from which CF to get the data.
> >
> >
> > So my  question here is: How many such CFs would it be wise to have? 2? 3?
6?
> >
> >
> > > With respect to your  discussion on region  splits..
> > > So you're saying that if one CF splits then all  of  the CFs are affected
>and
>
> > >split as well?
> >
> > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and
> > compactions, not splits, but from what I understand flushes can trigger  
>splits
>
> > because they increase the aggregate size of MapFiles, which at  some point
>causes
>
> > Region splitting.  Please correct me if I'm  wrong. :)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB