Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Schema design, one-to-many question


Copy link to this message
-
Schema design, one-to-many question
I have read comments on modeling one-to-many relationships in HBase and
wanted to get some feedback. I have millions of customers, and each customer
can make zero to thousands of orders. I want to store all of this data in
HBase. The data is always accessed by customer.

It seems there are a few schema design approaches.

Approach 1: Orders table. One row per order. Customer data is either
denormalized, or the customer ID is stored for lookup in a customer data
cache. Table will have billions of rows of a few columns each.

key: customer ID + order ID
family 1: customer (customer:id)
family 2: order (order:id, order:amount, order:date, etc.)

Approach 2: Customer table. One row per customer. All orders are stored in a
column family with order ID in the column name. Millions of rows with
potentially thousands of columns each.

key: customer ID
family 1: customer (customer:id, customer:name, customer:city, etc.)
family 2: order (order:id_<id of order>, order:amount_<id of order>,
order:date_<id of order>)

Approach 3: Same as #2, but store the order data as a serialized blob
instead of in separate columns:

key: customer ID
family 1: customer (customer:id, customer:name, customer:city, etc.)
family 2: order (order:<id of order>)

Approach 4: Not sure if this is viable, but same as #2 but use versions in
the order family to store multiple orders.

key: customer ID
family 1: customer (customer:idm customer:name, customer:city, etc.)
family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions

I am thinking approach #1 is probably the correct approach, but #2 and #3
(and #4?) would be more efficient from an application standpoint, as
everything is processed by customer and I won't need a customer data cache
or worry about updating denormalized data. Does anyone have feedback as to
what approaches work for them for data sets like this, and why?