Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Embedded table data model


Copy link to this message
-
Re: Embedded table data model
Column families are not the same thing as columns. You should indeed have a small number of column families, as that article points out. Columns (aka column qualifiers) are run-time defined key/value pairs that contain the data for every row, and having large numbers of these is fine.

On Jul 12, 2012, at 7:27 PM, "Cole" <[EMAIL PROTECTED]> wrote:

> I think this design has some question, please refer
> http://hbase.apache.org/book/number.of.cfs.html
>
> 2012/7/12 Ian Varley <[EMAIL PROTECTED]>
>
>> Yes, that's fine; you can always do a single column PUT into an existing
>> row, in a concurrency-safe way, and the lock on the row is only held as
>> long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
>> architecture, that's efficient because the PUT only goes to memory, and is
>> merged with on-disk records at read time (until a regular flush or
>> compaction happens).
>>
>> So even though you already have, say, 10K transactions in the table, it's
>> still efficient to PUT a single new transaction in (whether that's in the
>> middle of the sorted list of columns, at the end, etc.)
>>
>> Ian
>>
>> On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:
>>
>> but they are other writers insert new transactions into the table when
>> customers do new transactions.
>>
>> On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[EMAIL PROTECTED]
>> <mailto:[EMAIL PROTECTED]>> wrote:
>> Hi Xiaobo -
>>
>> For HBase, this is doable; you could have a single table in HBase where
>> each row is a customer (with the customerid as the rowkey), and columns for
>> each of the 300 attributes that are directly part of the customer entity.
>> This is sparse, so you'd only take up space for the attributes that
>> actually exist for each customer.
>>
>> You could then have (possibly in another column family, but not
>> necessarily) an additional column for each transaction, where the column
>> name is composed of a date concatenated with the transaction id, in which
>> you store the 30 attributes as serialized into a single byte array in the
>> cell value. (Or, you could alternately do each attribute as its own column
>> but there's no advantage to doing so, since presumably a transaction is
>> roughly like an immutable event that you wouldn't typically change just a
>> single attribute of.) A schema for this (if spelled out in an xml
>> representation) could be:
>>
>> <table name="customer">
>> <key>
>>   <column name="customerid">
>> </key>
>> <columnfamily name="1">
>>   <column name="customer_attribute_1" />
>>   <column name="customer_attribute_2" />
>>   ...
>>   <column name="customer_attribute_300" />
>> </columnFamily>
>> <columnFamily name="2">
>>   <entity name="transaction" values="serialized">
>>     <key>
>>       <column name="transaction_date" type="date">
>>       <column name="transaction_id" />
>>     </key>
>>     <column name="transaction_attribute_1" />
>>     <column name="transaction_attribute_2" />
>>     ...
>>     <column name="transaction_attribute_30" />
>>   </entity>
>> </columnFamily>
>> </table>
>>
>> (This isn't real HBase syntax, it's just an abstract way to show you the
>> structure.) In practice, HBase isn't doing anything "special" with the
>> entity that lives nested inside your table; it's just a matter of
>> convention, that you could "see" it that way. The customer-level attributes
>> (like, say, "customer_name" and "customer_address") would be literal column
>> names (aka column qualifiers) embedded in your code, whereas the
>> transaction-oriented columns would be created at runtime with column names
>> like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
>> objects (containing the 30 attributes) serialized into a byte array.
>>
>> In this scenario, you get fast access to any customer by ID, and further
>> to a range of transactions by date (using, say, a column pagination
>> filter). This would perform roughly equivalently regardless of how many
>> customers are in the table, or how many transactions exist for each
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB