Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Schema design, one-to-many question


Copy link to this message
-
Schema design, one-to-many question
I have read comments on modeling one-to-many relationships in HBase and
wanted to get some feedback. I have millions of customers, and each customer
can make zero to thousands of orders. I want to store all of this data in
HBase. The data is always accessed by customer.

It seems there are a few schema design approaches.

Approach 1: Orders table. One row per order. Customer data is either
denormalized, or the customer ID is stored for lookup in a customer data
cache. Table will have billions of rows of a few columns each.

key: customer ID + order ID
family 1: customer (customer:id)
family 2: order (order:id, order:amount, order:date, etc.)

Approach 2: Customer table. One row per customer. All orders are stored in a
column family with order ID in the column name. Millions of rows with
potentially thousands of columns each.

key: customer ID
family 1: customer (customer:id, customer:name, customer:city, etc.)
family 2: order (order:id_<id of order>, order:amount_<id of order>,
order:date_<id of order>)

Approach 3: Same as #2, but store the order data as a serialized blob
instead of in separate columns:

key: customer ID
family 1: customer (customer:id, customer:name, customer:city, etc.)
family 2: order (order:<id of order>)

Approach 4: Not sure if this is viable, but same as #2 but use versions in
the order family to store multiple orders.

key: customer ID
family 1: customer (customer:idm customer:name, customer:city, etc.)
family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions

I am thinking approach #1 is probably the correct approach, but #2 and #3
(and #4?) would be more efficient from an application standpoint, as
everything is processed by customer and I won't need a customer data cache
or worry about updating denormalized data. Does anyone have feedback as to
what approaches work for them for data sets like this, and why?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB