Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Suggested and max number of CFs per table


Copy link to this message
-
Re: Suggested and max number of CFs per table
Otis Gospodnetic 2011-03-17, 17:38
Hi,
> Patrick,
>
> Perhaps I misunderstood Otis' design.
>
> I thought  he'd create the CF based on duration.
> So you could have a CF for (daily,  weekly, monthly, annual, indefinite).
> So that you set up the table once with  all CFs.
> Then you'd write the data to one and only one of those  buckets.

That's right.

> The only time you'd have a problem is if you have a tenant who  switches their
>retention policy.
>
> Although you could move data still in a CF  so that you still only query one CF
>for data.

That's right.  Say a tenant decides to switch from keeping his data for 1 month
to keeping it for 6 months.
Then we'd have to:
1) start writing new data for this tenant to the 6-month CF
2) copy this tenant's old data from 1-month CF to the 6-month CF
3) purge/delete old data for this tenant from 1-month CF

If the tenant wants to go from 6-months to 1-month then we'd additionally want
to limit copying in step 2) above to just the last 1 month of data and drop the
rest.

To answer Mike's questions from his other reply:

> What's the data access patterns? Are they discrete between tenants?
> As long as the data access is discrete between tenants and the tenants write to
>only one bucket, you can do what you suggest.

Yes, data for a given tenant would be written to just 1 of those CFs.

> But here's something to consider...
> You  are going to want to know your tenant's retention policy before you  
>attempt to get the data. This means you read from one column family when  you do
>your get() and not all of them, right? ;-)

Yes, when reading the data I'd know the tenant's retention policy and based on
that I'd know from which CF to get the data.
So my question here is: How many such CFs would it be wise to have? 2? 3? 6?
> With respect to your  discussion on region splits..
> So you're saying that if one CF splits then all  of the CFs are affected and
>split as well?

http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and
compactions, not splits, but from what I understand flushes can trigger splits
because they increase the aggregate size of MapFiles, which at some point causes
Region splitting.  Please correct me if I'm wrong. :)

So this is also what I wanted to verify.  As you can imagine, there's likely be
more tenants with 1-month data retention policy than 1-year or "forever" data
retention.  So that 1-month CF will grow much more quickly and if I understand
the above section in HBase book correctly, it means that it will cause all other
CFs' files to split (even if they are not big enough yet), which means more disk
and network IO.

That is, if all those CFs are in the same table.  If they are in different
tables then this would not happen?

Thanks,
Otis

> >  Date: Thu, 17 Mar 2011 11:26:35 -0400
> > Subject: Re: Suggested and max  number of CFs per table
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > CC: [EMAIL PROTECTED]
> >
> > Otis,
> >
> > Perhaps your biggest issue will be the need to  disable the table to add a
> > new CF. So effectively you need to bring down  the application to move in a
> > new tenant.
> >
> > Another thing  with multiple CFs is that if one CF tends to get
> > disproportionally more  data, you will get a lot of region splitting, and
the
> > other CFs will  have HFiles for a region that are very small.
> >
> > I think the only  reasonable use of CFs is if you really need row-level
> > atomicity across  CFs. Otherwise just use multiple tables.
> >
> >
> > On Thu, Mar  17, 2011 at 2:30 AM, Otis Gospodnetic <
> > [EMAIL PROTECTED]>  wrote:
> >
> > > Hi,
> > >
> > > My Q is around the  suggested or maximum number of CFs per table (see
> > > http://hbase.apache.org/book/schema.html#number.of.cfs )
> >  >
> > > Consider the following use-case.
> > > * A multi-tenant  system.
> > > * All tenants write data to the same table.
> > > *  Tenants have different data retention policies.
> > >
> > > For  the above use case I thought one could then just have different CFs
purge
http://search-hadoop.com/m/VAeb52cvWHV.
TTL=6
the