Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> How would you model this in Hbase?


Copy link to this message
-
Re: How would you model this in Hbase?
Point well taken, Ulrich - I'm not very familiar with the domain here, but what you're saying makes sense. These aren't "mistakes that are being corrected", they're really two different pieces of information, and the difference between them is interesting in and of itself. In that case, explicitly modeling it is definitely better. :)

Ian

On Feb 7, 2013, at 7:51 AM, Ulrich Staudinger wrote:

Hi there,

No offence meant Ian. I might also think too trading oriented.

You definitely want to have those numbers readily available and not as a
version. In retrospective, you will want to know by how much the actuals
were off. Or you will want to run a trading strategy against the actuals
...

It is the same with any of those macro figures.

Revised and initially reported are two separate types of information and
there is (usually) always a revised figure.

And when doing research, I wouldn't dare start with versioning unless it is
absolutely clear that the original value is wrong, void and worthless.

Cheers

P.s. pardon for double posting an hour ago.
Am 07.02.2013 14:36 schrieb "Ian Varley" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>:

Overloading the time stamp aka the versions of the cell is really not a
good idea.

I agree in general, guys (and noted the dangers in my original post). I'd
note, however, that this may be one of the rare cases where this actually
*isn't* overloading the timestamp. If you look at the OP's question, this
really is two versions of a single value. The data originally came in as X,
then a month later it's revised to Y. If the majority of queries are going
to just ask "what's the latest value", then this will make it easy in
HBase, because that's the default behavior. And if you want to do a time
travel query, that too is easy (you just set the max date you'd like to
use). Doing either of those things with the reporting_month explicitly
factored into the model (in the key, say) is harder. (Not impossible, just
more complicated.)

In a relational database, you might model this as a simple "UPDATE econ SET
value = '2.5' WHERE figure='unemployment' AND month_reporting '2011-11-01'". But the downside there is you'd lose the old value, and
wouldn't be able to time travel. But in HBase you can.

Overloading the timestamp is a terrible idea if you make it mean something
other than "the date at which this data was valid". But that's not what's
happening here, that's exactly what he's looking for.

Ian

On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote:

On 02/06/2013 01:49 PM, Michael Segel wrote:

Overloading the time stamp aka the versions of the cell is really not a
good idea.
Fully agree.

Yeah, I know opinions are like A.... everyone has one. ;-)
Yeah, but some people share one.
But you have to be aware that if someone decides to delete some data...
well one tombstone marker for the column, goodbye all of the versions of
the cell.
(Any ideas on a clean easy way to remove that tombstone?  ;-)

You're better off using other methods of adding dimension to your cell.
One that works well is using Avro.

All the usual caveats apply: don't bother with HBase unless you've got
some serious size in your data (e.g. TB) and need to support a heavy load
of real-time updates and queries. Otherwise, go with something simpler to
operate like a relational database, couchdb, etc.
While this is a valid point for just storing it and working on your own
with data, there are reasons why you want to choose a data integration
platform (more on this later).

Back to the root discussion.

Why don't you simply identify the six different types of information per
number:

- figure name (unemployment)
- month (reporting)
- release date
- figure
- revision date
- revised figure

the key would be:
<figure name>_<month>

en voila.

I strongly advise against "overloading" the timestamping/versioning feature
of hbase.
You would still have to load the entire series and sort it by what you
like, but that's not a problem with hbase.
<snip>
Thinking in ActiveQuant, you would store each of the columns above through
it's IArchiveWriter. Then you can seamlessly view/chart it in the
ActiveQuant Master Server, making it available over CSV and SOAP to your
corporate intranet or to Excel through the AQ plugin.
</snip>
Ulrich Staudinger

http://www.activequant.org
Connect online: https://www.xing.com/profile/Ulrich_Staudinger