Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Re: How would you model this in Hbase?


+
Ulrich Staudinger 2013-02-07, 13:51
+
Ian Varley 2013-02-07, 14:00
+
Alex Grund 2013-02-06, 20:24
Copy link to this message
-
Re: How would you model this in Hbase?
Alex,

This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates:

 - table
 - row key
 - column family
 - column qualifier
 - timestamp

So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be:

 - table: econ
 - row key: "indicatorABC"
 - column family: cf1
 - column qualifier: "reporting_2011-10-01"

first value:
 - timestamp: "2011-11-01 00:00:00.000"
 - value: 2

second value:
 - timestamp: "2011-12-01 00:00:00.000"
 - value: 2.5

I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01".

(Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.)

There are some downsides to using the time dimension explicitly like this:
 - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs.
 - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.)

All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc.

Ian

On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:

Hi,

I am a newbie in nosql-databases and I am wondering how to model a
specific case with Hbase.

The thing I want to model are economic time series, such as
unemployment rate in a given country.

The complicated thing is this: Values of an economic time series can,
but do not have to be revised.

An example:

Imagine, the time series is published monthly, at the first day of a
month with the value for the previous month, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

(where "release" is the date of release and "reporting" is the date of
the month the "value" refers to. Read: "On Dec 1, 2011 the
unemployement rate for Nov 2011 was reported to be "1").

Now, imagine, that on every release, the value for the previous month
is revised, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
for Sep, and the revised value for Aug was reported to be "4.5".

The most recent observation (release) ex-post is:  [1]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Since the data is not revised further than one month behind, the whole
series ex-post would look like that: [3]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Whereas, the "known-to-market"-series would look like that: [2]

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

That are the series I want to get from the db.
How would you model this with Hbase? Is Hbase suitable for that
application? Or in general, a column oriented DB?

Or, is a a relational approach a better fit?
Thanks!
+
Michael Segel 2013-02-06, 21:49
+
James Taylor 2013-02-06, 22:01
+
Ulrich Staudinger 2013-02-07, 07:26
+
Ian Varley 2013-02-07, 13:35
+
Ulrich Staudinger 2013-02-07, 07:14
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB