Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Custom versioning best practices


Copy link to this message
-
Custom versioning best practices
David Koch 2012-11-22, 13:55
Hello,

I was thinking of using versions with custom timestamps to store the
evolution of a column value - as opposed to creating several (time_t,
value_at_time_t) qualifier-value pairs. The value to be stored is a single
integer. Fast ad-hoc retrieval of multiple versions based on a row key +
filter [1] (i.e through a web service) is important, the number of row keys
will be between 10^6 and 10^9.

a) If the number of versions (timestamps) is moderate, can I expect
read/filtering performance to be better than when using multiple
qualifier/value pairs?
b) For a larger number of versions, say 365, what if any precautions should
I take with respect to the HBase/table setup.

I looked around a bit and found the following:

The documentation [2] mentions that the maximum number of versions should
not be too high ("in the hundreds"). The HBase o'Reilly book [3] on the
other hand mentions that Facebook use(d) versions to store inbox messages
in order. Clearly, the number of messages may grow quite large (>> 100). Is
[1] still valid with more recent versions of HBase?

Thank you,

/David

[1]
http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/filter/TimestampsFilter.html
[2] http://hbase.apache.org/book/schema.versions.html
[3] 1st edition, page 384