Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Embedded table data model

Copy link to this message
Re: Embedded table data model
Column families are not the same thing as columns. You should indeed have a small number of column families, as that article points out. Columns (aka column qualifiers) are run-time defined key/value pairs that contain the data for every row, and having large numbers of these is fine.

On Jul 12, 2012, at 7:27 PM, "Cole" <[EMAIL PROTECTED]> wrote:

> I think this design has some question, please refer
> http://hbase.apache.org/book/number.of.cfs.html
> 2012/7/12 Ian Varley <[EMAIL PROTECTED]>
>> Yes, that's fine; you can always do a single column PUT into an existing
>> row, in a concurrency-safe way, and the lock on the row is only held as
>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
>> architecture, that's efficient because the PUT only goes to memory, and is
>> merged with on-disk records at read time (until a regular flush or
>> compaction happens).
>> So even though you already have, say, 10K transactions in the table, it's
>> still efficient to PUT a single new transaction in (whether that's in the
>> middle of the sorted list of columns, at the end, etc.)
>> Ian
>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>> but they are other writers insert new transactions into the table when
>> customers do new transactions.
>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>> wrote:
>> Hi Xiaobo -
>> For HBase, this is doable; you could have a single table in HBase where
>> each row is a customer (with the customerid as the rowkey), and columns for
>> each of the 300 attributes that are directly part of the customer entity.
>> This is sparse, so you'd only take up space for the attributes that
>> actually exist for each customer.
>> You could then have (possibly in another column family, but not
>> necessarily) an additional column for each transaction, where the column
>> name is composed of a date concatenated with the transaction id, in which
>> you store the 30 attributes as serialized into a single byte array in the
>> cell value. (Or, you could alternately do each attribute as its own column
>> but there's no advantage to doing so, since presumably a transaction is
>> roughly like an immutable event that you wouldn't typically change just a
>> single attribute of.) A schema for this (if spelled out in an xml
>> representation) could be:
>> <table name="customer">
>> <key>
>>   <column name="customerid">
>> </key>
>> <columnfamily name="1">
>>   <column name="customer_attribute_1" />
>>   <column name="customer_attribute_2" />
>>   ...
>>   <column name="customer_attribute_300" />
>> </columnFamily>
>> <columnFamily name="2">
>>   <entity name="transaction" values="serialized">
>>     <key>
>>       <column name="transaction_date" type="date">
>>       <column name="transaction_id" />
>>     </key>
>>     <column name="transaction_attribute_1" />
>>     <column name="transaction_attribute_2" />
>>     ...
>>     <column name="transaction_attribute_30" />
>>   </entity>
>> </columnFamily>
>> </table>
>> (This isn't real HBase syntax, it's just an abstract way to show you the
>> structure.) In practice, HBase isn't doing anything "special" with the
>> entity that lives nested inside your table; it's just a matter of
>> convention, that you could "see" it that way. The customer-level attributes
>> (like, say, "customer_name" and "customer_address") would be literal column
>> names (aka column qualifiers) embedded in your code, whereas the
>> transaction-oriented columns would be created at runtime with column names
>> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
>> objects (containing the 30 attributes) serialized into a byte array.
>> In this scenario, you get fast access to any customer by ID, and further
>> to a range of transactions by date (using, say, a column pagination
>> filter). This would perform roughly equivalently regardless of how many
>> customers are in the table, or how many transactions exist for each