Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How would you model this in Hbase?


Copy link to this message
-
Re: How would you model this in Hbase?
Ian Varley 2013-02-06, 21:05
Alex,

This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates:

 - table
 - row key
 - column family
 - column qualifier
 - timestamp

So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be:

 - table: econ
 - row key: "indicatorABC"
 - column family: cf1
 - column qualifier: "reporting_2011-10-01"

first value:
 - timestamp: "2011-11-01 00:00:00.000"
 - value: 2

second value:
 - timestamp: "2011-12-01 00:00:00.000"
 - value: 2.5

I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01".

(Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.)

There are some downsides to using the time dimension explicitly like this:
 - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs.
 - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.)

All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc.

Ian

On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:

Hi,

I am a newbie in nosql-databases and I am wondering how to model a
specific case with Hbase.

The thing I want to model are economic time series, such as
unemployment rate in a given country.

The complicated thing is this: Values of an economic time series can,
but do not have to be revised.

An example:

Imagine, the time series is published monthly, at the first day of a
month with the value for the previous month, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

(where "release" is the date of release and "reporting" is the date of
the month the "value" refers to. Read: "On Dec 1, 2011 the
unemployement rate for Nov 2011 was reported to be "1").

Now, imagine, that on every release, the value for the previous month
is revised, such like:

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Read: On Oct, 1, 2011, the unemployment rate was reported to be "3"
for Sep, and the revised value for Aug was reported to be "4.5".

The most recent observation (release) ex-post is:  [1]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Since the data is not revised further than one month behind, the whole
series ex-post would look like that: [3]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5

Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5

Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5

Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5

Whereas, the "known-to-market"-series would look like that: [2]

Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4

That are the series I want to get from the db.
How would you model this with Hbase? Is Hbase suitable for that
application? Or in general, a column oriented DB?

Or, is a a relational approach a better fit?
Thanks!