|
Otis Gospodnetic
2011-03-17, 06:30
Michael Segel
2011-03-17, 15:10
Patrick Angeles
2011-03-17, 15:26
Michael Segel
2011-03-17, 15:48
Otis Gospodnetic
2011-03-17, 17:38
Michael Segel
2011-03-17, 18:35
Stack
2011-03-17, 19:05
Otis Gospodnetic
2011-03-17, 21:44
|
-
Suggested and max number of CFs per tableOtis Gospodnetic 2011-03-17, 06:30
Hi,
My Q is around the suggested or maximum number of CFs per table (see http://hbase.apache.org/book/schema.html#number.of.cfs ) Consider the following use-case. * A multi-tenant system. * All tenants write data to the same table. * Tenants have different data retention policies. For the above use case I thought one could then just have different CFs with different TTLs because Stack suggested relying on HBase's ability to purge old rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV. These CFs would have the same set of columns, just different TTLs. Then tenants who want to keep only last 1 month's worth of data go to the CF where TTL=1 month, tenants who want to keep last 6 months of data go to CF where TTL=6 months, and so on. However, tenants are not going to be evenly distributed - there will be more tenants with shorter data retention periods, which means the CFs where these tenants have their data will grow faster. If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfs correctly, the advice is not to have more than 2-3 CFs per table? And what happens if I have say 6 CFs per table? Again if I read the above page correctly, the problem is that uneven data distribution will mean that whenever 1 of my CFs needs to be flushed, the remaining 5 CFs will also get flushed at the same time, and this may (or will?) trigger compaction for all CFs' files creating a sudden IO hit? Is there a good solution for this problem? Should one then have 6 different tables, each with just 1 CF instead of having 1 table with 6 CFs? Thanks, Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
-
RE: Suggested and max number of CFs per tableMichael Segel 2011-03-17, 15:10
Otis, you sure are busy blogging. ;-) Ok but to answer your question... you want as few column families as possible. When we first started looking at HBase, we tried to view the column families as if they were relational tables and the key was a foreign key joining the two tables. (Its actually not a bad way for RDBMs data modelers to look at a column oriented database for the first time....) The trouble is that when you take someone who follows 3rd normal form design, you end up reading from two or more column families at the same time. This is where your problems begin because the data is actually stored in separate files, so you take a performance hit. With respect to your example... What's the data access patterns? Are they discrete between tenants? As long as the data access is discrete between tenants and the tenants write to only one bucket, you can do what you suggest. But here's something to consider... You are going to want to know your tenant's retention policy before you attempt to get the data. This means you read from one column family when you do your get() and not all of them, right? ;-) HTH -Mike > Date: Wed, 16 Mar 2011 23:30:14 -0700 > From: [EMAIL PROTECTED] > Subject: Suggested and max number of CFs per table > To: [EMAIL PROTECTED] > > Hi, > > My Q is around the suggested or maximum number of CFs per table (see > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > Consider the following use-case. > * A multi-tenant system. > * All tenants write data to the same table. > * Tenants have different data retention policies. > > For the above use case I thought one could then just have different CFs with > different TTLs because Stack suggested relying on HBase's ability to purge old > rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV. > These CFs would have the same set of columns, just different TTLs. Then tenants > who want to keep only last 1 month's worth of data go to the CF where TTL=1 > month, tenants who want to keep last 6 months of data go to CF where TTL=6 > months, and so on. However, tenants are not going to be evenly distributed - > there will be more tenants with shorter data retention periods, which means the > CFs where these tenants have their data will grow faster. > > If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfs correctly, > the advice is not to have more than 2-3 CFs per table? > And what happens if I have say 6 CFs per table? > > Again if I read the above page correctly, the problem is that uneven data > distribution will mean that whenever 1 of my CFs needs to be flushed, the > remaining 5 CFs will also get flushed at the same time, and this may (or will?) > trigger compaction for all CFs' files creating a sudden IO hit? > > Is there a good solution for this problem? > Should one then have 6 different tables, each with just 1 CF instead of having 1 > table with 6 CFs? > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ >
-
Re: Suggested and max number of CFs per tablePatrick Angeles 2011-03-17, 15:26
Otis,
Perhaps your biggest issue will be the need to disable the table to add a new CF. So effectively you need to bring down the application to move in a new tenant. Another thing with multiple CFs is that if one CF tends to get disproportionally more data, you will get a lot of region splitting, and the other CFs will have HFiles for a region that are very small. I think the only reasonable use of CFs is if you really need row-level atomicity across CFs. Otherwise just use multiple tables. On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic < [EMAIL PROTECTED]> wrote: > Hi, > > My Q is around the suggested or maximum number of CFs per table (see > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > Consider the following use-case. > * A multi-tenant system. > * All tenants write data to the same table. > * Tenants have different data retention policies. > > For the above use case I thought one could then just have different CFs > with > different TTLs because Stack suggested relying on HBase's ability to purge > old > rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV. > These CFs would have the same set of columns, just different TTLs. Then > tenants > who want to keep only last 1 month's worth of data go to the CF where TTL=1 > month, tenants who want to keep last 6 months of data go to CF where TTL=6 > months, and so on. However, tenants are not going to be evenly distributed > - > there will be more tenants with shorter data retention periods, which means > the > CFs where these tenants have their data will grow faster. > > If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfscorrectly, > the advice is not to have more than 2-3 CFs per table? > And what happens if I have say 6 CFs per table? > > Again if I read the above page correctly, the problem is that uneven data > distribution will mean that whenever 1 of my CFs needs to be flushed, the > remaining 5 CFs will also get flushed at the same time, and this may (or > will?) > trigger compaction for all CFs' files creating a sudden IO hit? > > Is there a good solution for this problem? > Should one then have 6 different tables, each with just 1 CF instead of > having 1 > table with 6 CFs? > > Thanks, > Otis > ---- > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > Lucene ecosystem search :: http://search-lucene.com/ > >
-
RE: Suggested and max number of CFs per tableMichael Segel 2011-03-17, 15:48
Patrick, Perhaps I misunderstood Otis' design. I thought he'd create the CF based on duration. So you could have a CF for (daily, weekly, monthly, annual, indefinite). So that you set up the table once with all CFs. Then you'd write the data to one and only one of those buckets. The only time you'd have a problem is if you have a tenant who switches their retention policy. Although you could move data still in a CF so that you still only query one CF for data. With respect to your discussion on region splits.. So you're saying that if one CF splits then all of the CFs are affected and split as well? Thx -Mike > Date: Thu, 17 Mar 2011 11:26:35 -0400 > Subject: Re: Suggested and max number of CFs per table > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > CC: [EMAIL PROTECTED] > > Otis, > > Perhaps your biggest issue will be the need to disable the table to add a > new CF. So effectively you need to bring down the application to move in a > new tenant. > > Another thing with multiple CFs is that if one CF tends to get > disproportionally more data, you will get a lot of region splitting, and the > other CFs will have HFiles for a region that are very small. > > I think the only reasonable use of CFs is if you really need row-level > atomicity across CFs. Otherwise just use multiple tables. > > > On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > > > Hi, > > > > My Q is around the suggested or maximum number of CFs per table (see > > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > > > Consider the following use-case. > > * A multi-tenant system. > > * All tenants write data to the same table. > > * Tenants have different data retention policies. > > > > For the above use case I thought one could then just have different CFs > > with > > different TTLs because Stack suggested relying on HBase's ability to purge > > old > > rows by applying CF-specific TTLs: http://search-hadoop.com/m/VAeb52cvWHV. > > These CFs would have the same set of columns, just different TTLs. Then > > tenants > > who want to keep only last 1 month's worth of data go to the CF where TTL=1 > > month, tenants who want to keep last 6 months of data go to CF where TTL=6 > > months, and so on. However, tenants are not going to be evenly distributed > > - > > there will be more tenants with shorter data retention periods, which means > > the > > CFs where these tenants have their data will grow faster. > > > > If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfscorrectly, > > the advice is not to have more than 2-3 CFs per table? > > And what happens if I have say 6 CFs per table? > > > > Again if I read the above page correctly, the problem is that uneven data > > distribution will mean that whenever 1 of my CFs needs to be flushed, the > > remaining 5 CFs will also get flushed at the same time, and this may (or > > will?) > > trigger compaction for all CFs' files creating a sudden IO hit? > > > > Is there a good solution for this problem? > > Should one then have 6 different tables, each with just 1 CF instead of > > having 1 > > table with 6 CFs? > > > > Thanks, > > Otis > > ---- > > Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch > > Lucene ecosystem search :: http://search-lucene.com/ > > > >
-
Re: Suggested and max number of CFs per tableOtis Gospodnetic 2011-03-17, 17:38
Hi,
> Patrick, > > Perhaps I misunderstood Otis' design. > > I thought he'd create the CF based on duration. > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > So that you set up the table once with all CFs. > Then you'd write the data to one and only one of those buckets. That's right. > The only time you'd have a problem is if you have a tenant who switches their >retention policy. > > Although you could move data still in a CF so that you still only query one CF >for data. That's right. Say a tenant decides to switch from keeping his data for 1 month to keeping it for 6 months. Then we'd have to: 1) start writing new data for this tenant to the 6-month CF 2) copy this tenant's old data from 1-month CF to the 6-month CF 3) purge/delete old data for this tenant from 1-month CF If the tenant wants to go from 6-months to 1-month then we'd additionally want to limit copying in step 2) above to just the last 1 month of data and drop the rest. To answer Mike's questions from his other reply: > What's the data access patterns? Are they discrete between tenants? > As long as the data access is discrete between tenants and the tenants write to >only one bucket, you can do what you suggest. Yes, data for a given tenant would be written to just 1 of those CFs. > But here's something to consider... > You are going to want to know your tenant's retention policy before you >attempt to get the data. This means you read from one column family when you do >your get() and not all of them, right? ;-) Yes, when reading the data I'd know the tenant's retention policy and based on that I'd know from which CF to get the data. So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > With respect to your discussion on region splits.. > So you're saying that if one CF splits then all of the CFs are affected and >split as well? http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and compactions, not splits, but from what I understand flushes can trigger splits because they increase the aggregate size of MapFiles, which at some point causes Region splitting. Please correct me if I'm wrong. :) So this is also what I wanted to verify. As you can imagine, there's likely be more tenants with 1-month data retention policy than 1-year or "forever" data retention. So that 1-month CF will grow much more quickly and if I understand the above section in HBase book correctly, it means that it will cause all other CFs' files to split (even if they are not big enough yet), which means more disk and network IO. That is, if all those CFs are in the same table. If they are in different tables then this would not happen? Thanks, Otis > > Date: Thu, 17 Mar 2011 11:26:35 -0400 > > Subject: Re: Suggested and max number of CFs per table > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > CC: [EMAIL PROTECTED] > > > > Otis, > > > > Perhaps your biggest issue will be the need to disable the table to add a > > new CF. So effectively you need to bring down the application to move in a > > new tenant. > > > > Another thing with multiple CFs is that if one CF tends to get > > disproportionally more data, you will get a lot of region splitting, and the > > other CFs will have HFiles for a region that are very small. > > > > I think the only reasonable use of CFs is if you really need row-level > > atomicity across CFs. Otherwise just use multiple tables. > > > > > > On Thu, Mar 17, 2011 at 2:30 AM, Otis Gospodnetic < > > [EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > My Q is around the suggested or maximum number of CFs per table (see > > > http://hbase.apache.org/book/schema.html#number.of.cfs ) > > > > > > Consider the following use-case. > > > * A multi-tenant system. > > > * All tenants write data to the same table. > > > * Tenants have different data retention policies. > > > > > > For the above use case I thought one could then just have different CFs purge http://search-hadoop.com/m/VAeb52cvWHV. TTL=6 the
-
RE: Suggested and max number of CFs per tableMichael Segel 2011-03-17, 18:35
Otis, Patrick raised an issue that might be of concern... region splits. But barring that... what makes the most sense on retention policies? The point is that its a business issue that will be driving the logic. Depending on a clarification from Patrick or JGray or JDCryans... you may want to consider separate tables using the same key. You could also use a single table and run a sweeper every night that deletes the rows, and then do a major compaction after hours. (Again you would have to account for the maintenance window.) HTH -Mike > Date: Thu, 17 Mar 2011 10:38:11 -0700 > From: [EMAIL PROTECTED] > Subject: Re: Suggested and max number of CFs per table > To: [EMAIL PROTECTED] > > Hi, > > > > Patrick, > > > > Perhaps I misunderstood Otis' design. > > > > I thought he'd create the CF based on duration. > > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > > So that you set up the table once with all CFs. > > Then you'd write the data to one and only one of those buckets. > > That's right. > > > The only time you'd have a problem is if you have a tenant who switches their > >retention policy. > > > > Although you could move data still in a CF so that you still only query one CF > >for data. > > That's right. Say a tenant decides to switch from keeping his data for 1 month > to keeping it for 6 months. > Then we'd have to: > 1) start writing new data for this tenant to the 6-month CF > 2) copy this tenant's old data from 1-month CF to the 6-month CF > 3) purge/delete old data for this tenant from 1-month CF > > If the tenant wants to go from 6-months to 1-month then we'd additionally want > to limit copying in step 2) above to just the last 1 month of data and drop the > rest. > > To answer Mike's questions from his other reply: > > > What's the data access patterns? Are they discrete between tenants? > > As long as the data access is discrete between tenants and the tenants write to > >only one bucket, you can do what you suggest. > > Yes, data for a given tenant would be written to just 1 of those CFs. > > > But here's something to consider... > > You are going to want to know your tenant's retention policy before you > >attempt to get the data. This means you read from one column family when you do > >your get() and not all of them, right? ;-) > > Yes, when reading the data I'd know the tenant's retention policy and based on > that I'd know from which CF to get the data. > > > So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > > > > With respect to your discussion on region splits.. > > So you're saying that if one CF splits then all of the CFs are affected and > >split as well? > > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and > compactions, not splits, but from what I understand flushes can trigger splits > because they increase the aggregate size of MapFiles, which at some point causes > Region splitting. Please correct me if I'm wrong. :) > > So this is also what I wanted to verify. As you can imagine, there's likely be > more tenants with 1-month data retention policy than 1-year or "forever" data > retention. So that 1-month CF will grow much more quickly and if I understand > the above section in HBase book correctly, it means that it will cause all other > CFs' files to split (even if they are not big enough yet), which means more disk > and network IO. > > That is, if all those CFs are in the same table. If they are in different > tables then this would not happen? > > Thanks, > Otis > > > > > > Date: Thu, 17 Mar 2011 11:26:35 -0400 > > > Subject: Re: Suggested and max number of CFs per table > > > From: [EMAIL PROTECTED] > > > To: [EMAIL PROTECTED] > > > CC: [EMAIL PROTECTED] > > > > > > Otis, > > > > > > Perhaps your biggest issue will be the need to disable the table to add a > > > new CF. So effectively you need to bring down the application to move in a
-
Re: Suggested and max number of CFs per tableStack 2011-03-17, 19:05
On Wed, Mar 16, 2011 at 11:30 PM, Otis Gospodnetic
<[EMAIL PROTECTED]> wrote: > If I'm reading http://hbase.apache.org/book/schema.html#number.of.cfs correctly, > the advice is not to have more than 2-3 CFs per table? > And what happens if I have say 6 CFs per table? > > Again if I read the above page correctly, the problem is that uneven data > distribution will mean that whenever 1 of my CFs needs to be flushed, the > remaining 5 CFs will also get flushed at the same time, and this may (or will?) > trigger compaction for all CFs' files creating a sudden IO hit? > > Is there a good solution for this problem? > Should one then have 6 different tables, each with just 1 CF instead of having 1 > table with 6 CFs? > Just to say that the reason we do not do > 3-4 CFs in a row well is because we haven't done the work to make it work nicely. As is, we do dumb stuff like the above mentioned flush all CFs if one is at limit even if others are small but then we also do stuff like serialize lookups across the CFs instead of running queries in parallel if the query is x-CFs (fixing this is one of the oldest issues in hbase). St.Ack
-
Re: Suggested and max number of CFs per tableOtis Gospodnetic 2011-03-17, 21:44
Hi,
> Patrick raised an issue that might be of concern... region splits. Right. And if I understand correctly, if I want to have multiple CFs that grow unevenly, these region splits are something I have to then be willing to accept. > But barring that... what makes the most sense on retention policies? > > The point is that its a business issue that will be driving the logic. The exact business requirement is not defined yet. Say 3-4 retention policies. > Depending on a clarification from Patrick or JGray or JDCryans... you may want >to consider separate tables using the same key. > You could also use a single table and run a sweeper every night that deletes >the rows, and then do a major compaction after hours. > (Again you would have to account for the maintenance window.) Right. The reason I am even thinking about CF-per-retention-policy is because I am afraid of a big and expensive nightly scan-and-delete. That said, I don't actually know how and if this scan will be expensive. So I'm trying to understand pros and cons ahead of time. Maybe I'm prematurely optimizing, but since this feels like a big structural/architectural change, I thought it would be worth "getting it right" before I have lots of tenants and their data in the system. Thank you everyone! Otis > HTH > > -Mike > > > > Date: Thu, 17 Mar 2011 10:38:11 -0700 > > From: [EMAIL PROTECTED] > > Subject: Re: Suggested and max number of CFs per table > > To: [EMAIL PROTECTED] > > > > Hi, > > > > > > > Patrick, > > > > > > Perhaps I misunderstood Otis' design. > > > > > > I thought he'd create the CF based on duration. > > > So you could have a CF for (daily, weekly, monthly, annual, indefinite). > > > So that you set up the table once with all CFs. > > > Then you'd write the data to one and only one of those buckets. > > > > That's right. > > > > > The only time you'd have a problem is if you have a tenant who switches >their > > > >retention policy. > > > > > > Although you could move data still in a CF so that you still only query >one CF > > > >for data. > > > > That's right. Say a tenant decides to switch from keeping his data for 1 >month > > > to keeping it for 6 months. > > Then we'd have to: > > 1) start writing new data for this tenant to the 6-month CF > > 2) copy this tenant's old data from 1-month CF to the 6-month CF > > 3) purge/delete old data for this tenant from 1-month CF > > > > If the tenant wants to go from 6-months to 1-month then we'd additionally >want > > > to limit copying in step 2) above to just the last 1 month of data and drop >the > > > rest. > > > > To answer Mike's questions from his other reply: > > > > > What's the data access patterns? Are they discrete between tenants? > > > As long as the data access is discrete between tenants and the tenants >write to > > > >only one bucket, you can do what you suggest. > > > > Yes, data for a given tenant would be written to just 1 of those CFs. > > > > > But here's something to consider... > > > You are going to want to know your tenant's retention policy before you > > >attempt to get the data. This means you read from one column family when >you do > > > >your get() and not all of them, right? ;-) > > > > Yes, when reading the data I'd know the tenant's retention policy and based >on > > > that I'd know from which CF to get the data. > > > > > > So my question here is: How many such CFs would it be wise to have? 2? 3? 6? > > > > > > > With respect to your discussion on region splits.. > > > So you're saying that if one CF splits then all of the CFs are affected >and > > > >split as well? > > > > http://hbase.apache.org/book/schema.html#number.of.cfs mentions flushes and > > compactions, not splits, but from what I understand flushes can trigger >splits > > > because they increase the aggregate size of MapFiles, which at some point >causes > > > Region splitting. Please correct me if I'm wrong. :) |