Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Embedded table data model


Copy link to this message
-
Re: Embedded table data model
Ian Varley 2012-07-13, 04:55
Yes, that's what I mean.

It is not the only way to model this, but your question was, "Can we embedded the transactions inside the customer table in HBase".

On Jul 12, 2012, at 8:21 PM, "Xiaobo Gu" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

Hi Ian,

Do you mean each transaction will be created as a column inside the cf
for transactions, and these columns are created dynamically as
transactions occur?

Regards,

Xiaobo Gu

On Fri, Jul 13, 2012 at 11:08 AM, Ian Varley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Column families are not the same thing as columns. You should indeed have a small number of column families, as that article points out. Columns (aka column qualifiers) are run-time defined key/value pairs that contain the data for every row, and having large numbers of these is fine.

On Jul 12, 2012, at 7:27 PM, "Cole" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:

I think this design has some question, please refer
http://hbase.apache.org/book/number.of.cfs.html

2012/7/12 Ian Varley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>

Yes, that's fine; you can always do a single column PUT into an existing
row, in a concurrency-safe way, and the lock on the row is only held as
long as it takes to do that. Because of HBase's Log-Structured Merge-Tree
architecture, that's efficient because the PUT only goes to memory, and is
merged with on-disk records at read time (until a regular flush or
compaction happens).

So even though you already have, say, 10K transactions in the table, it's
still efficient to PUT a single new transaction in (whether that's in the
middle of the sorted list of columns, at the end, etc.)

Ian

On Jul 11, 2012, at 11:27 PM, Xiaobo Gu wrote:

but they are other writers insert new transactions into the table when
customers do new transactions.

On Thu, Jul 12, 2012 at 1:13 PM, Ian Varley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
<mailto:[EMAIL PROTECTED]>> wrote:
Hi Xiaobo -

For HBase, this is doable; you could have a single table in HBase where
each row is a customer (with the customerid as the rowkey), and columns for
each of the 300 attributes that are directly part of the customer entity.
This is sparse, so you'd only take up space for the attributes that
actually exist for each customer.

You could then have (possibly in another column family, but not
necessarily) an additional column for each transaction, where the column
name is composed of a date concatenated with the transaction id, in which
you store the 30 attributes as serialized into a single byte array in the
cell value. (Or, you could alternately do each attribute as its own column
but there's no advantage to doing so, since presumably a transaction is
roughly like an immutable event that you wouldn't typically change just a
single attribute of.) A schema for this (if spelled out in an xml
representation) could be:

<table name="customer">
<key>
 <column name="customerid">
</key>
<columnfamily name="1">
 <column name="customer_attribute_1" />
 <column name="customer_attribute_2" />
 ...
 <column name="customer_attribute_300" />
</columnFamily>
<columnFamily name="2">
 <entity name="transaction" values="serialized">
   <key>
     <column name="transaction_date" type="date">
     <column name="transaction_id" />
   </key>
   <column name="transaction_attribute_1" />
   <column name="transaction_attribute_2" />
   ...
   <column name="transaction_attribute_30" />
 </entity>
</columnFamily>
</table>

(This isn't real HBase syntax, it's just an abstract way to show you the
structure.) In practice, HBase isn't doing anything "special" with the
entity that lives nested inside your table; it's just a matter of
convention, that you could "see" it that way. The customer-level attributes
(like, say, "customer_name" and "customer_address") would be literal column
names (aka column qualifiers) embedded in your code, whereas the
transaction-oriented columns would be created at runtime with column names
like "2012-07-11 12:34:56_TXN12345", and values that are simply collection
objects (containing the 30 attributes) serialized into a byte array.

In this scenario, you get fast access to any customer by ID, and further
to a range of transactions by date (using, say, a column pagination
filter). This would perform roughly equivalently regardless of how many
customers are in the table, or how many transactions exist for each
customer. What you'd lose on this design would be the ability to get a
single transaction for a single customer by ID (since you're storing them
by date). But if you need that, you could actually store it both ways. You
also might be introducing some extra contention on concurrent transaction
PUT requests for a single client, because they'd have to fight over a lock
for the row (but that's probably not a big deal, since it's only
contentious within each customer).

You might find my presentation on designing HBase schemas (from this
year's HBaseCon) useful:

http://www.hbasecon.com/sessions/hbase-schema-design-2/

Ian

On Jul 11, 2012, at 10:58 PM, Xiaobo Gu wrote:

Hi,

I have technical problem, and wander whether HBase or Cassandra
support Embedded table data model, or can somebody show me a way to do
this:

1.We have a very large customer entity table which have 100 milliion
rows, each customer row has about 300 attributes(columns).
2.Each customer do about 1000 transactions per year, each transaction
has about 30 attributes(columns), and we just save one year
transactions for each customer

We want a data model that  we can get the customer entity with all the
transactions which he did for a single client call within a fixed time
window, according to the customer id (which is the primary key of the
customer table). We do the following in RDBMS,
A customer table with customerid as the primary key, A transaction
table with customer id as a secondary index, and join them , or we
must do two separate  calls, and b