Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Embedded table data model


+
Xiaobo Gu 2012-07-12, 03:58
+
Ian Varley 2012-07-12, 05:13
+
Xiaobo Gu 2012-07-12, 06:27
+
Ian Varley 2012-07-12, 06:31
+
Cole 2012-07-13, 02:26
+
Ian Varley 2012-07-13, 03:08
+
Xiaobo Gu 2012-07-13, 03:21
+
Ian Varley 2012-07-13, 04:55
+
Guxiaobo 2012-07-13, 12:02
Copy link to this message
-
Re: Embedded table data model
Michael Segel 2012-07-13, 14:55
First,
A caveat... Schema design in HBase is one of the hardest things to teach/learn because its so open. There is more than one correct answer when it comes to creating a good design...

Ian's presentation kind of tries to relate HBase schema design to relational modeling.
From past experience, I found that to be a bit confusing and somewhat limiting because it didn't allow the student to look beyond relational structures when thinking about the data. (Its really hard to get ER modelers to make the transition. )

First, you can't think about HBase in terms of transactions. Transactional processing doesn't exist in HBase and what HBase has in terms of RLL isn't the same in terms of transactional processing RLL.

Also there are problems w the concept of column families. If the data set size are not roughly equal, you end up w a lot of small files for the one CF because when a region splits all CFs split.  So unless you have a good reason to share the same key across multiple records, you really don't want to use CFs. (Or use them sparingly.)

Note, I said records.
That is because you need to think of your row of data as a self contained record.
Think of when you go to your doctor's office and they pull out a hard copy of your medical records.
That folder (in my case, thick folder... ;-)  contains your entire patient medical history.
That folder would be synonymous to an HBase record/row. As you can see, you end up tossing the relational model out the window.

An example in terms of a PoS/Customer Order Entry  system.

You could consider having a record of a customer's order.
Then you can have one column family for the customer's relatively static information like contacts, phone#, addresses, etc...
One column family for Orders
One column family for Invoices
Once column family for Pick Slips

All based on a composite key of your customer_id and then order_num.

Since the column data is a byte array (everything is a byte array) the data stored in a column could be a primitive data type or some more complex structure.

In this example, I used CF's because these are records that are tied together by the same key but serve different purposes.
When you place an order, you generate one or more pick slips.
You may also generate one or more invoices associated to that order or multiple orders if you allow customers to have a consolidated account and bill monthly.

A lot of the design depends on your data and its primary use case.

As always YMMV.
On Jul 12, 2012, at 10:08 PM, Ian Varley wrote:

> Column families are not the same thing as columns. You should indeed have a small number of column families, as that article points out. Columns (aka column qualifiers) are run-time defined key/value pairs that contain the data for every row, and having large numbers of these is fine.
>
>
>
> On Jul 12, 2012, at 7:27 PM, "Cole" <[EMAIL PROTECTED]> wrote:
>
>> I think this design has some question, please refer
>> http://hbase.apache.org/book/number.of.cfs.html
>>
>> 2012/7/12 Ian Varley <[EMAIL PROTECTED]>
>>
>>> Yes, that's fine; you can always do a single column PUT into an existing
>>> row, in a concurrency-safe way, and the lock on the row is only held as
>>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
>>> architecture, that's efficient because the PUT only goes to memory, and is
>>> merged with on-disk records at read time (until a regular flush or
>>> compaction happens).
>>>
>>> So even though you already have, say, 10K transactions in the table, it's
>>> still efficient to PUT a single new transaction in (whether that's in the
>>> middle of the sorted list of columns, at the end, etc.)
>>>
>>> Ian
>>>
>>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>>>
>>> but they are other writers insert new transactions into the table when
>>> customers do new transactions.
>>>
>>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[EMAIL PROTECTED]
>>> <mailto:[EMAIL PROTECTED]>> wrote: