Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - How would you model this in Hbase?


Copy link to this message
-
Re: How would you model this in Hbase?
Ian Varley 2013-02-07, 13:35
Overloading the time stamp aka the versions of the cell is really not a
good idea.

I agree in general, guys (and noted the dangers in my original post). I'd note, however, that this may be one of the rare cases where this actually *isn't* overloading the timestamp. If you look at the OP's question, this really is two versions of a single value. The data originally came in as X, then a month later it's revised to Y. If the majority of queries are going to just ask "what's the latest value", then this will make it easy in HBase, because that's the default behavior. And if you want to do a time travel query, that too is easy (you just set the max date you'd like to use). Doing either of those things with the reporting_month explicitly factored into the model (in the key, say) is harder. (Not impossible, just more complicated.)

In a relational database, you might model this as a simple "UPDATE econ SET value = '2.5' WHERE figure='unemployment' AND month_reporting = '2011-11-01'". But the downside there is you'd lose the old value, and wouldn't be able to time travel. But in HBase you can.

Overloading the timestamp is a terrible idea if you make it mean something other than "the date at which this data was valid". But that's not what's happening here, that's exactly what he's looking for.

Ian

On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote:

On 02/06/2013 01:49 PM, Michael Segel wrote:

Overloading the time stamp aka the versions of the cell is really not a
good idea.
Fully agree.

Yeah, I know opinions are like A.... everyone has one. ;-)
Yeah, but some people share one.
But you have to be aware that if someone decides to delete some data...
well one tombstone marker for the column, goodbye all of the versions of
the cell.
(Any ideas on a clean easy way to remove that tombstone?  ;-)

You're better off using other methods of adding dimension to your cell.
One that works well is using Avro.

All the usual caveats apply: don't bother with HBase unless you've got
some serious size in your data (e.g. TB) and need to support a heavy load
of real-time updates and queries. Otherwise, go with something simpler to
operate like a relational database, couchdb, etc.
While this is a valid point for just storing it and working on your own
with data, there are reasons why you want to choose a data integration
platform (more on this later).

Back to the root discussion.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.
You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.
<snip>
Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet or to Excel through the AQ plugin.
</snip>
--
Ulrich Staudinger

http://www.activequant.org
Connect online: https://www.xing.com/profile/Ulrich_Staudinger