|
Bryan Keller
2010-11-29, 21:40
Chen Xinli
2010-11-29, 23:58
Jonathan Gray
2010-11-30, 00:14
Bryan Keller
2010-11-30, 01:13
Jonathan Gray
2010-11-30, 01:18
Michael Segel
2010-11-30, 13:31
Jonathan Gray
2010-11-30, 16:11
Michael Segel
2010-11-30, 17:23
|
-
Schema design, one-to-many questionBryan Keller 2010-11-29, 21:40
I have read comments on modeling one-to-many relationships in HBase and
wanted to get some feedback. I have millions of customers, and each customer can make zero to thousands of orders. I want to store all of this data in HBase. The data is always accessed by customer. It seems there are a few schema design approaches. Approach 1: Orders table. One row per order. Customer data is either denormalized, or the customer ID is stored for lookup in a customer data cache. Table will have billions of rows of a few columns each. key: customer ID + order ID family 1: customer (customer:id) family 2: order (order:id, order:amount, order:date, etc.) Approach 2: Customer table. One row per customer. All orders are stored in a column family with order ID in the column name. Millions of rows with potentially thousands of columns each. key: customer ID family 1: customer (customer:id, customer:name, customer:city, etc.) family 2: order (order:id_<id of order>, order:amount_<id of order>, order:date_<id of order>) Approach 3: Same as #2, but store the order data as a serialized blob instead of in separate columns: key: customer ID family 1: customer (customer:id, customer:name, customer:city, etc.) family 2: order (order:<id of order>) Approach 4: Not sure if this is viable, but same as #2 but use versions in the order family to store multiple orders. key: customer ID family 1: customer (customer:idm customer:name, customer:city, etc.) family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions I am thinking approach #1 is probably the correct approach, but #2 and #3 (and #4?) would be more efficient from an application standpoint, as everything is processed by customer and I won't need a customer data cache or worry about updating denormalized data. Does anyone have feedback as to what approaches work for them for data sets like this, and why?
-
Re: Schema design, one-to-many questionChen Xinli 2010-11-29, 23:58
we have a ssimilar usecase, millions of user and each user with
different number of goods, from one to tens of thousands. we use approach 2 Bryan Keller <[EMAIL PROTECTED]>编写: I have read comments on modeling one-to-many relationships in HBase and wanted to get some feedback. I have millions of customers, and each customer can make zero to thousands of orders. I want to store all of this data in HBase. The data is always accessed by customer. It seems there are a few schema design approaches. Approach 1: Orders table. One row per order. Customer data is either denormalized, or the customer ID is stored for lookup in a customer data cache. Table will have billions of rows of a few columns each. key: customer ID + order ID family 1: customer (customer:id) family 2: order (order:id, order:amount, order:date, etc.) Approach 2: Customer table. One row per customer. All orders are stored in a column family with order ID in the column name. Millions of rows with potentially thousands of columns each. key: customer ID family 1: customer (customer:id, customer:name, customer:city, etc.) family 2: order (order:id_<id of order>, order:amount_<id of order>, order:date_<id of order>) Approach 3: Same as #2, but store the order data as a serialized blob instead of in separate columns: key: customer ID family 1: customer (customer:id, customer:name, customer:city, etc.) family 2: order (order:<id of order>) Approach 4: Not sure if this is viable, but same as #2 but use versions in the order family to store multiple orders. key: customer ID family 1: customer (customer:idm customer:name, customer:city, etc.) family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions I am thinking approach #1 is probably the correct approach, but #2 and #3 (and #4?) would be more efficient from an application standpoint, as everything is processed by customer and I won't need a customer data cache or worry about updating denormalized data. Does anyone have feedback as to what approaches work for them for data sets like this, and why?
-
RE: Schema design, one-to-many questionJonathan Gray 2010-11-30, 00:14
Hey Bryan,
All of these approaches could work and seem sane. My preference these days would be the wide-table approach (#2, 3, 4) rather than the tall table. Previously #1 was more efficient but in 0.90 and beyond the same optimizations exist for both tall and wide tables. For #2, I would probably structure the qualifier as <id_of_order>_fieldname (rather than the other way around). Then the fields for a given order are continuous (rather than grouping by the fieldname). If you have some existing serialization method you are using in your application, #3 would make sense. #4 wouldn't be ideal because HBase sorts on column before version, so fields for a given order would not be continuous thus reads would be inefficient. This is similar to the issue with the ordering of id/field in #2. The most important thing is to design this so you have efficient reads. I imagine one of the important queries is something like "get me all the info for this order". If so, it would be important that all fields for an order are together. JG > -----Original Message----- > From: Bryan Keller [mailto:[EMAIL PROTECTED]] > Sent: Monday, November 29, 2010 1:41 PM > To: [EMAIL PROTECTED] > Subject: Schema design, one-to-many question > > I have read comments on modeling one-to-many relationships in HBase and > wanted to get some feedback. I have millions of customers, and each > customer > can make zero to thousands of orders. I want to store all of this data in > HBase. The data is always accessed by customer. > > It seems there are a few schema design approaches. > > Approach 1: Orders table. One row per order. Customer data is either > denormalized, or the customer ID is stored for lookup in a customer data > cache. Table will have billions of rows of a few columns each. > > key: customer ID + order ID > family 1: customer (customer:id) > family 2: order (order:id, order:amount, order:date, etc.) > > Approach 2: Customer table. One row per customer. All orders are stored in > a > column family with order ID in the column name. Millions of rows with > potentially thousands of columns each. > > key: customer ID > family 1: customer (customer:id, customer:name, customer:city, etc.) > family 2: order (order:id_<id of order>, order:amount_<id of order>, > order:date_<id of order>) > > Approach 3: Same as #2, but store the order data as a serialized blob > instead of in separate columns: > > key: customer ID > family 1: customer (customer:id, customer:name, customer:city, etc.) > family 2: order (order:<id of order>) > > Approach 4: Not sure if this is viable, but same as #2 but use versions in > the order family to store multiple orders. > > key: customer ID > family 1: customer (customer:idm customer:name, customer:city, etc.) > family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions > > I am thinking approach #1 is probably the correct approach, but #2 and #3 > (and #4?) would be more efficient from an application standpoint, as > everything is processed by customer and I won't need a customer data cache > or worry about updating denormalized data. Does anyone have feedback as to > what approaches work for them for data sets like this, and why?
-
Re: Schema design, one-to-many questionBryan Keller 2010-11-30, 01:13
I am using 0.89 currently, does it include those optimizations set for 0.90? If so, great news, the wide table approach is what I preferred.
On Nov 29, 2010, at 4:14 PM, Jonathan Gray wrote: > Hey Bryan, > > All of these approaches could work and seem sane. > > My preference these days would be the wide-table approach (#2, 3, 4) rather than the tall table. Previously #1 was more efficient but in 0.90 and beyond the same optimizations exist for both tall and wide tables. > > For #2, I would probably structure the qualifier as <id_of_order>_fieldname (rather than the other way around). Then the fields for a given order are continuous (rather than grouping by the fieldname). > > If you have some existing serialization method you are using in your application, #3 would make sense. > > #4 wouldn't be ideal because HBase sorts on column before version, so fields for a given order would not be continuous thus reads would be inefficient. This is similar to the issue with the ordering of id/field in #2. > > The most important thing is to design this so you have efficient reads. I imagine one of the important queries is something like "get me all the info for this order". If so, it would be important that all fields for an order are together. > > JG > >> -----Original Message----- >> From: Bryan Keller [mailto:[EMAIL PROTECTED]] >> Sent: Monday, November 29, 2010 1:41 PM >> To: [EMAIL PROTECTED] >> Subject: Schema design, one-to-many question >> >> I have read comments on modeling one-to-many relationships in HBase and >> wanted to get some feedback. I have millions of customers, and each >> customer >> can make zero to thousands of orders. I want to store all of this data in >> HBase. The data is always accessed by customer. >> >> It seems there are a few schema design approaches. >> >> Approach 1: Orders table. One row per order. Customer data is either >> denormalized, or the customer ID is stored for lookup in a customer data >> cache. Table will have billions of rows of a few columns each. >> >> key: customer ID + order ID >> family 1: customer (customer:id) >> family 2: order (order:id, order:amount, order:date, etc.) >> >> Approach 2: Customer table. One row per customer. All orders are stored in >> a >> column family with order ID in the column name. Millions of rows with >> potentially thousands of columns each. >> >> key: customer ID >> family 1: customer (customer:id, customer:name, customer:city, etc.) >> family 2: order (order:id_<id of order>, order:amount_<id of order>, >> order:date_<id of order>) >> >> Approach 3: Same as #2, but store the order data as a serialized blob >> instead of in separate columns: >> >> key: customer ID >> family 1: customer (customer:id, customer:name, customer:city, etc.) >> family 2: order (order:<id of order>) >> >> Approach 4: Not sure if this is viable, but same as #2 but use versions in >> the order family to store multiple orders. >> >> key: customer ID >> family 1: customer (customer:idm customer:name, customer:city, etc.) >> family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions >> >> I am thinking approach #1 is probably the correct approach, but #2 and #3 >> (and #4?) would be more efficient from an application standpoint, as >> everything is processed by customer and I won't need a customer data cache >> or worry about updating denormalized data. Does anyone have feedback as to >> what approaches work for them for data sets like this, and why?
-
RE: Schema design, one-to-many questionJonathan Gray 2010-11-30, 01:18
Depends on which version of 0.89. I think the September release had most if not all of them but I'm not positive.
In any case, 0.90 release is just around the corner and you'll be strongly encouraged to upgrade :) > -----Original Message----- > From: Bryan Keller [mailto:[EMAIL PROTECTED]] > Sent: Monday, November 29, 2010 5:13 PM > To: [EMAIL PROTECTED] > Subject: Re: Schema design, one-to-many question > > I am using 0.89 currently, does it include those optimizations set for > 0.90? If so, great news, the wide table approach is what I preferred. > > On Nov 29, 2010, at 4:14 PM, Jonathan Gray wrote: > > > Hey Bryan, > > > > All of these approaches could work and seem sane. > > > > My preference these days would be the wide-table approach (#2, 3, 4) > rather than the tall table. Previously #1 was more efficient but in 0.90 > and beyond the same optimizations exist for both tall and wide tables. > > > > For #2, I would probably structure the qualifier as > <id_of_order>_fieldname (rather than the other way around). Then the > fields for a given order are continuous (rather than grouping by the > fieldname). > > > > If you have some existing serialization method you are using in your > application, #3 would make sense. > > > > #4 wouldn't be ideal because HBase sorts on column before version, so > fields for a given order would not be continuous thus reads would be > inefficient. This is similar to the issue with the ordering of id/field in > #2. > > > > The most important thing is to design this so you have efficient reads. > I imagine one of the important queries is something like "get me all the > info for this order". If so, it would be important that all fields for an > order are together. > > > > JG > > > >> -----Original Message----- > >> From: Bryan Keller [mailto:[EMAIL PROTECTED]] > >> Sent: Monday, November 29, 2010 1:41 PM > >> To: [EMAIL PROTECTED] > >> Subject: Schema design, one-to-many question > >> > >> I have read comments on modeling one-to-many relationships in HBase and > >> wanted to get some feedback. I have millions of customers, and each > >> customer > >> can make zero to thousands of orders. I want to store all of this data > in > >> HBase. The data is always accessed by customer. > >> > >> It seems there are a few schema design approaches. > >> > >> Approach 1: Orders table. One row per order. Customer data is either > >> denormalized, or the customer ID is stored for lookup in a customer data > >> cache. Table will have billions of rows of a few columns each. > >> > >> key: customer ID + order ID > >> family 1: customer (customer:id) > >> family 2: order (order:id, order:amount, order:date, etc.) > >> > >> Approach 2: Customer table. One row per customer. All orders are stored > in > >> a > >> column family with order ID in the column name. Millions of rows with > >> potentially thousands of columns each. > >> > >> key: customer ID > >> family 1: customer (customer:id, customer:name, customer:city, etc.) > >> family 2: order (order:id_<id of order>, order:amount_<id of order>, > >> order:date_<id of order>) > >> > >> Approach 3: Same as #2, but store the order data as a serialized blob > >> instead of in separate columns: > >> > >> key: customer ID > >> family 1: customer (customer:id, customer:name, customer:city, etc.) > >> family 2: order (order:<id of order>) > >> > >> Approach 4: Not sure if this is viable, but same as #2 but use versions > in > >> the order family to store multiple orders. > >> > >> key: customer ID > >> family 1: customer (customer:idm customer:name, customer:city, etc.) > >> family 2: order (order:id, order:amount, order:date, etc.) - 1000 > versions > >> > >> I am thinking approach #1 is probably the correct approach, but #2 and > #3 > >> (and #4?) would be more efficient from an application standpoint, as > >> everything is processed by customer and I won't need a customer data > cache > >> or worry about updating denormalized data. Does anyone have feedback as
-
RE: Schema design, one-to-many questionMichael Segel 2010-11-30, 13:31
I'm sorry if this has already been answered, but I'll share my $0.02 anyway... First, you and everyone have to stop thinking of hbase in terms of a relational model. Because Hbase doesn't have the concept of joins, you can not think of relationships. If you have two tables where the primary key of both tables are the same (primary key == foreign key) then you can put the two relational tables within the same table but in different column families. Having said that... usually in this case you would store them in the same column family, however there may be a valid reason to separate them. (Row size and access patterns may make it a good idea to separate them out for performance reasons). In your example, which is a classic example... don't think in terms of relational but think in terms of a hierarchy data base. Think in terms of Dick Pick. (See: http://en.wikipedia.org/wiki/Pick_operating_system ) Pick or Pick like systems were Revelation (Now I'm showing my age...) , U2 aka Universe (VMark was acquired by Informix, then IBM, then spun off recently... I think) is a good example of how to model within HBase. So if we look at your example... how do you plan on accessing the information? Since everyone likes to talk 'agile' ... think about your story lines.... "A grasshopper walks in to a bar ... " (Sorry bad joke). In your case... A customer logs in to your website. Starts to place an order.... When the customer logs in... you know his customer id. So you can keep the customer information in a separate table since you're not looking up the data immediately. You can then use the customer id as part of the key for your order table. In fact, I'd make it the first part of the composite key since the customer may not always remember their order numbers and you will want to search the order table (yes, its clear that you want a table for your orders too.) So your key could be customer id + order num. Then when a customer wants to find his orders, you can do a scan with a start key and end key based on customer id. Beyond this... there are strategies on how you create your customer id to get better utilization of your cloud. NOTE THE FOLLOWING: This example and design does not use indexing. If you want 'real time' performance, you'll need to incorporate indexing in to your design, however that's another story... HTH -Mike > Date: Mon, 29 Nov 2010 13:40:58 -0800 > Subject: Schema design, one-to-many question > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > > I have read comments on modeling one-to-many relationships in HBase and > wanted to get some feedback. I have millions of customers, and each customer > can make zero to thousands of orders. I want to store all of this data in > HBase. The data is always accessed by customer. > > It seems there are a few schema design approaches. > > Approach 1: Orders table. One row per order. Customer data is either > denormalized, or the customer ID is stored for lookup in a customer data > cache. Table will have billions of rows of a few columns each. > > key: customer ID + order ID > family 1: customer (customer:id) > family 2: order (order:id, order:amount, order:date, etc.) > > Approach 2: Customer table. One row per customer. All orders are stored in a > column family with order ID in the column name. Millions of rows with > potentially thousands of columns each. > > key: customer ID > family 1: customer (customer:id, customer:name, customer:city, etc.) > family 2: order (order:id_<id of order>, order:amount_<id of order>, > order:date_<id of order>) > > Approach 3: Same as #2, but store the order data as a serialized blob > instead of in separate columns: > > key: customer ID > family 1: customer (customer:id, customer:name, customer:city, etc.) > family 2: order (order:<id of order>) > > Approach 4: Not sure if this is viable, but same as #2 but use versions in > the order family to store multiple orders. > > key: customer ID > family 1: customer (customer:idm customer:name, customer:city, etc.)
-
RE: Schema design, one-to-many questionJonathan Gray 2010-11-30, 16:11
I'm not sure I agree that "you can not think of relationships".
There is in fact a one-to-many relationship in the described schema, regardless of the database you are modeling it in. I'm not sure why you would not want to think about these relations? And a row being userid with a family containing a column per order seems like a very legitimate way to store a one-to-many relationship. However it's not exactly as you've described (primary key == foreign key). One primary key is userid, the other is orderid (or if orderid is only unique per user, the second primary key is userid+orderid). Relational thinking would be three tables (users, orders, userorders), HBase thinking would be one table keyed on user. What you need to "drop" from the relational model is the idea of breaking these things up into different tables to join them later. With column-orientation, you can have the user as the row and stuff all of his relations into that same row. JG > -----Original Message----- > From: Michael Segel [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, November 30, 2010 5:32 AM > To: [EMAIL PROTECTED] > Subject: RE: Schema design, one-to-many question > > > I'm sorry if this has already been answered, but I'll share my $0.02 > anyway... > > > First, you and everyone have to stop thinking of hbase in terms of a > relational model. Because Hbase doesn't have the concept of joins, you can > not think of relationships. > If you have two tables where the primary key of both tables are the same > (primary key == foreign key) then you can put the two relational tables > within the same table but in different column families. > > Having said that... usually in this case you would store them in the same > column family, however there may be a valid reason to separate them. (Row > size and access patterns may make it a good idea to separate them out for > performance reasons). > > In your example, which is a classic example... don't think in terms of > relational but think in terms of a hierarchy data base. Think in terms of > Dick Pick. > (See: http://en.wikipedia.org/wiki/Pick_operating_system ) > > Pick or Pick like systems were Revelation (Now I'm showing my age...) , U2 > aka Universe (VMark was acquired by Informix, then IBM, then spun off > recently... I think) is a good example of how to model within HBase. > > So if we look at your example... how do you plan on accessing the > information? > Since everyone likes to talk 'agile' ... think about your story lines.... > "A grasshopper walks in to a bar ... " (Sorry bad joke). > In your case... A customer logs in to your website. Starts to place an > order.... > When the customer logs in... you know his customer id. So you can keep the > customer information in a separate table since you're not looking up the > data immediately. You can then use the customer id as part of the key for > your order table. In fact, I'd make it the first part of the composite key > since the customer may not always remember their order numbers and you will > want to search the order table (yes, its clear that you want a table for > your orders too.) So your key could be customer id + order num. > > Then when a customer wants to find his orders, you can do a scan with a > start key and end key based on customer id. > > Beyond this... there are strategies on how you create your customer id to > get better utilization of your cloud. > > NOTE THE FOLLOWING: > This example and design does not use indexing. If you want 'real time' > performance, you'll need to incorporate indexing in to your design, however > that's another story... > > HTH > > -Mike > > > > Date: Mon, 29 Nov 2010 13:40:58 -0800 > > Subject: Schema design, one-to-many question > > From: [EMAIL PROTECTED] > > To: [EMAIL PROTECTED] > > > > I have read comments on modeling one-to-many relationships in HBase and > > wanted to get some feedback. I have millions of customers, and each > customer > > can make zero to thousands of orders. I want to store all of this data in
-
RE: Schema design, one-to-many questionMichael Segel 2010-11-30, 17:23
> From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: RE: Schema design, one-to-many question > Date: Tue, 30 Nov 2010 16:11:14 +0000 > > I'm not sure I agree that "you can not think of relationships". > > There is in fact a one-to-many relationship in the described schema, regardless of the database you are modeling it in. I'm not sure why you would not want to think about these relations? > > And a row being userid with a family containing a column per order seems like a very legitimate way to store a one-to-many relationship. However it's not exactly as you've described (primary key == foreign key). One primary key is userid, the other is orderid (or if orderid is only unique per user, the second primary key is userid+orderid). Relational thinking would be three tables (users, orders, userorders), HBase thinking would be one table keyed on user. > > What you need to "drop" from the relational model is the idea of breaking these things up into different tables to join them later. With column-orientation, you can have the user as the row and stuff all of his relations into that same row. > > JG Sigh. I knew this would cause trouble... :-) The point I'm trying to make is that you need to break away from thinking in terms of everything as a 3rd Normal form relational model. That's not to say that there aren't relationships between tables, or the organization, but that you're not following relational modeling. In a hierarchial ?sp? model, you do have relationships, but not necessarily in terms of a relational model. A simple example... in an order, you have a section of order detail (sku, description, qty, unit price, order amt). In a relational model, the order detail is a separate table. In a hierarchy, the order detail is actually a column of the order row. (Like I said, look at Dick Pick's stuff.) Note: There's more than one way to skin the cat... which means that you can have two models that are both correct. The key is how you access the data. In this example, you're going to hit your customer table first, then you're going to hit your order table. (Unless you're letting anyone who knows an order number look it up without authenticating the user/customer ...) ;-) Sorry but its hard to chunk out a design in quick e-mail blurbs. HTH -Mike |