Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Schema design, one-to-many question


Copy link to this message
-
Re: Schema design, one-to-many question
Bryan Keller 2010-11-30, 01:13
I am using 0.89 currently, does it include those optimizations set for 0.90? If so, great news, the wide table approach is what I preferred.

On Nov 29, 2010, at 4:14 PM, Jonathan Gray wrote:

> Hey Bryan,
>
> All of these approaches could work and seem sane.
>
> My preference these days would be the wide-table approach (#2, 3, 4) rather than the tall table.  Previously #1 was more efficient but in 0.90 and beyond the same optimizations exist for both tall and wide tables.
>
> For #2, I would probably structure the qualifier as <id_of_order>_fieldname (rather than the other way around).  Then the fields for a given order are continuous (rather than grouping by the fieldname).
>
> If you have some existing serialization method you are using in your application, #3 would make sense.
>
> #4 wouldn't be ideal because HBase sorts on column before version, so fields for a given order would not be continuous thus reads would be inefficient.  This is similar to the issue with the ordering of id/field in #2.
>
> The most important thing is to design this so you have efficient reads.  I imagine one of the important queries is something like "get me all the info for this order".  If so, it would be important that all fields for an order are together.
>
> JG
>
>> -----Original Message-----
>> From: Bryan Keller [mailto:[EMAIL PROTECTED]]
>> Sent: Monday, November 29, 2010 1:41 PM
>> To: [EMAIL PROTECTED]
>> Subject: Schema design, one-to-many question
>>
>> I have read comments on modeling one-to-many relationships in HBase and
>> wanted to get some feedback. I have millions of customers, and each
>> customer
>> can make zero to thousands of orders. I want to store all of this data in
>> HBase. The data is always accessed by customer.
>>
>> It seems there are a few schema design approaches.
>>
>> Approach 1: Orders table. One row per order. Customer data is either
>> denormalized, or the customer ID is stored for lookup in a customer data
>> cache. Table will have billions of rows of a few columns each.
>>
>> key: customer ID + order ID
>> family 1: customer (customer:id)
>> family 2: order (order:id, order:amount, order:date, etc.)
>>
>> Approach 2: Customer table. One row per customer. All orders are stored in
>> a
>> column family with order ID in the column name. Millions of rows with
>> potentially thousands of columns each.
>>
>> key: customer ID
>> family 1: customer (customer:id, customer:name, customer:city, etc.)
>> family 2: order (order:id_<id of order>, order:amount_<id of order>,
>> order:date_<id of order>)
>>
>> Approach 3: Same as #2, but store the order data as a serialized blob
>> instead of in separate columns:
>>
>> key: customer ID
>> family 1: customer (customer:id, customer:name, customer:city, etc.)
>> family 2: order (order:<id of order>)
>>
>> Approach 4: Not sure if this is viable, but same as #2 but use versions in
>> the order family to store multiple orders.
>>
>> key: customer ID
>> family 1: customer (customer:idm customer:name, customer:city, etc.)
>> family 2: order (order:id, order:amount, order:date, etc.) - 1000 versions
>>
>> I am thinking approach #1 is probably the correct approach, but #2 and #3
>> (and #4?) would be more efficient from an application standpoint, as
>> everything is processed by customer and I won't need a customer data cache
>> or worry about updating denormalized data. Does anyone have feedback as to
>> what approaches work for them for data sets like this, and why?