|
|
-
How would you model this in Hbase?
Alex Grund 2013-02-06, 20:24
Hi,
I am a newbie in nosql-databases and I am wondering how to model a specific case with Hbase.
The thing I want to model are economic time series, such as unemployment rate in a given country.
The complicated thing is this: Values of an economic time series can, but do not have to be revised.
An example:
Imagine, the time series is published monthly, at the first day of a month with the value for the previous month, such like:
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
(where "release" is the date of release and "reporting" is the date of the month the "value" refers to. Read: "On Dec 1, 2011 the unemployement rate for Nov 2011 was reported to be "1").
Now, imagine, that on every release, the value for the previous month is revised, such like:
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4 Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
Read: On Oct, 1, 2011, the unemployment rate was reported to be "3" for Sep, and the revised value for Aug was reported to be "4.5".
The most recent observation (release) ex-post is: [1] Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Since the data is not revised further than one month behind, the whole series ex-post would look like that: [3] Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
Whereas, the "known-to-market"-series would look like that: [2]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
That are the series I want to get from the db. How would you model this with Hbase? Is Hbase suitable for that application? Or in general, a column oriented DB?
Or, is a a relational approach a better fit? Thanks!
-
Re: How would you model this in Hbase?
Ian Varley 2013-02-06, 21:05
Alex,
This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates:
- table - row key - column family - column qualifier - timestamp
So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be:
- table: econ - row key: "indicatorABC" - column family: cf1 - column qualifier: "reporting_2011-10-01"
first value: - timestamp: "2011-11-01 00:00:00.000" - value: 2
second value: - timestamp: "2011-12-01 00:00:00.000" - value: 2.5
I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01".
(Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.)
There are some downsides to using the time dimension explicitly like this: - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs. - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.)
All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc.
Ian
On Feb 6, 2013, at 2:24 PM, Alex Grund wrote:
Hi,
I am a newbie in nosql-databases and I am wondering how to model a specific case with Hbase.
The thing I want to model are economic time series, such as unemployment rate in a given country.
The complicated thing is this: Values of an economic time series can, but do not have to be revised.
An example:
Imagine, the time series is published monthly, at the first day of a month with the value for the previous month, such like:
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
(where "release" is the date of release and "reporting" is the date of the month the "value" refers to. Read: "On Dec 1, 2011 the unemployement rate for Nov 2011 was reported to be "1").
Now, imagine, that on every release, the value for the previous month is revised, such like:
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4 Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
Read: On Oct, 1, 2011, the unemployment rate was reported to be "3" for Sep, and the revised value for Aug was reported to be "4.5".
The most recent observation (release) ex-post is: [1] Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Since the data is not revised further than one month behind, the whole series ex-post would look like that: [3] Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5
Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
Unemployment; release: 2011/10/01; reporting: 2011/08/01; value: 4.5
Unemployment; release: 2011/09/01; reporting: 2011/07/01; value: 5.5
Whereas, the "known-to-market"-series would look like that: [2]
Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4
That are the series I want to get from the db. How would you model this with Hbase? Is Hbase suitable for that application? Or in general, a column oriented DB?
Or, is a a relational approach a better fit? Thanks!
-
Re: How would you model this in Hbase?
Michael Segel 2013-02-06, 21:49
Overloading the time stamp aka the versions of the cell is really not a good idea.
Yeah, I know opinions are like A.... everyone has one. ;-)
But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. (Any ideas on a clean easy way to remove that tombstone? ;-)
You're better off using other methods of adding dimension to your cell. One that works well is using Avro.
When I teach a course on HBase, I do mention about cells in my schema design section of the course. I talk about the ability to use the versioning as a way to add dimension and then tell the students that this really isn't a good idea ...
-Just saying...
On Feb 6, 2013, at 3:05 PM, Ian Varley <[EMAIL PROTECTED]> wrote:
> Alex, > > This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates: > > - table > - row key > - column family > - column qualifier > - timestamp > > So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be: > > - table: econ > - row key: "indicatorABC" > - column family: cf1 > - column qualifier: "reporting_2011-10-01" > > first value: > - timestamp: "2011-11-01 00:00:00.000" > - value: 2 > > second value: > - timestamp: "2011-12-01 00:00:00.000" > - value: 2.5 > > I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01". > > (Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.) > > There are some downsides to using the time dimension explicitly like this: > - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs. > - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.) > > All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc. > > Ian > > On Feb 6, 2013, at 2:24 PM, Alex Grund wrote: > > Hi, > > I am a newbie in nosql-databases and I am wondering how to model a > specific case with Hbase. > > The thing I want to model are economic time series, such as > unemployment rate in a given country. > > The complicated thing is this: Values of an economic time series can, > but do not have to be revised. > > An example: > > Imagine, the time series is published monthly, at the first day of a > month with the value for the previous month, such like: > > Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 > Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 > Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 > Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4 > > (where "release" is the date of release and "reporting" is the date of > the month the "value" refers to. Read: "On Dec 1, 2011 the > unemployement rate for Nov 2011 was reported to be "1"). > > Now, imagine, that on every release, the value for the previous month > is revised, such like: > > Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 > Unemployment; release: 2011/12/01; reporting: 2011/10/01; value: 2.5 > > Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 > Unemployment; release: 2011/11/01; reporting: 2011/09/01; value: 3.5
The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental. Use at your own risk. Michael Segel michael_segel (AT) hotmail.com
-
Re: How would you model this in Hbase?
James Taylor 2013-02-06, 22:01
Another approach would be to use Phoenix ( http://github.com/forcedotcom/phoenix). You can model your schema as you would in the relational world, but you get the horizontal scalability of HBase. James On 02/06/2013 01:49 PM, Michael Segel wrote: > Overloading the time stamp aka the versions of the cell is really not a good idea. > > Yeah, I know opinions are like A.... everyone has one. ;-) > > But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. > (Any ideas on a clean easy way to remove that tombstone? ;-) > > You're better off using other methods of adding dimension to your cell. One that works well is using Avro. > > When I teach a course on HBase, I do mention about cells in my schema design section of the course. I talk about the ability to use the versioning as a way to add dimension and then tell the students that this really isn't a good idea ... > > -Just saying... > > On Feb 6, 2013, at 3:05 PM, Ian Varley <[EMAIL PROTECTED]> wrote: > >> Alex, >> >> This might be an interesting use of the time dimension in HBase. Every value in HBase is uniquely represented by a set of coordinates: >> >> - table >> - row key >> - column family >> - column qualifier >> - timestamp >> >> So, you can have two different values that have all the same coordinates, except their timestamp. So for your example, that could be: >> >> - table: econ >> - row key: "indicatorABC" >> - column family: cf1 >> - column qualifier: "reporting_2011-10-01" >> >> first value: >> - timestamp: "2011-11-01 00:00:00.000" >> - value: 2 >> >> second value: >> - timestamp: "2011-12-01 00:00:00.000" >> - value: 2.5 >> >> I.e., if you load the data such that the timestamps on the values represent the release date, then you can model this in a natural way. By default, reads in HBase will only give you the latest value, but you can manually tell a scanner to give you "time travel" by only reporting values as of an older date; so you could say "tell me what the data would have said on 11/01". >> >> (Also, by default, HBase only keeps a limited number of historical versions (3), but you can tell it to keep all versions.) >> >> There are some downsides to using the time dimension explicitly like this: >> - If you back date things and also work with deletes, you could get some weird behavior depending on when compaction runs. >> - If you have lots of versions of things, the server still has to read over these when you scan, which makes things slower. (Probably doesn't apply if you only have a couple historical versions of any given value.) >> >> All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc. >> >> Ian >> >> On Feb 6, 2013, at 2:24 PM, Alex Grund wrote: >> >> Hi, >> >> I am a newbie in nosql-databases and I am wondering how to model a >> specific case with Hbase. >> >> The thing I want to model are economic time series, such as >> unemployment rate in a given country. >> >> The complicated thing is this: Values of an economic time series can, >> but do not have to be revised. >> >> An example: >> >> Imagine, the time series is published monthly, at the first day of a >> month with the value for the previous month, such like: >> >> Unemployment; release: 2011/12/01; reporting: 2011/11/01; value: 1 >> Unemployment; release: 2011/11/01; reporting: 2011/10/01; value: 2 >> Unemployment; release: 2011/10/01; reporting: 2011/09/01; value: 3 >> Unemployment; release: 2011/09/01; reporting: 2011/08/01; value: 4 >> >> (where "release" is the date of release and "reporting" is the date of >> the month the "value" refers to. Read: "On Dec 1, 2011 the >> unemployement rate for Nov 2011 was reported to be "1"). >> >> Now, imagine, that on every release, the value for the previous month
-
Re: How would you model this in Hbase?
Ulrich Staudinger 2013-02-07, 07:14
Why don't you simply identify the six different types of information per number: - figure name (unemployment) - month (reporting) - release date - figure - revision date - revised figure the key would be: <figure name>_<month> en voila. I strongly advise against "overloading" the timestamping/versioning feature of hbase. You would still have to load the entire series and sort it by what you like, but that's not a problem with hbase. Thinking in ActiveQuant, you would store each of the columns above through it's IArchiveWriter. Then you can seamlessly view/chart it in the ActiveQuant Master Server, making it available over CSV and SOAP to your corporate intranet. Cheers On Wed, Feb 6, 2013 at 11:01 PM, James Taylor <[EMAIL PROTECTED]>wrote: > Another approach would be to use Phoenix ( http://github.com/**> forcedotcom/phoenix < http://github.com/forcedotcom/phoenix>). You can > model your schema as you would in the relational world, but you get the > horizontal scalability of HBase. > > James > > > On 02/06/2013 01:49 PM, Michael Segel wrote: > >> Overloading the time stamp aka the versions of the cell is really not a >> good idea. >> >> Yeah, I know opinions are like A.... everyone has one. ;-) >> >> But you have to be aware that if someone decides to delete some data... >> well one tombstone marker for the column, goodbye all of the versions of >> the cell. >> (Any ideas on a clean easy way to remove that tombstone? ;-) >> >> You're better off using other methods of adding dimension to your cell. >> One that works well is using Avro. >> >> When I teach a course on HBase, I do mention about cells in my schema >> design section of the course. I talk about the ability to use the >> versioning as a way to add dimension and then tell the students that this >> really isn't a good idea ... >> >> -Just saying... >> >> On Feb 6, 2013, at 3:05 PM, Ian Varley <[EMAIL PROTECTED]> wrote: >> >> Alex, >>> >>> This might be an interesting use of the time dimension in HBase. Every >>> value in HBase is uniquely represented by a set of coordinates: >>> >>> - table >>> - row key >>> - column family >>> - column qualifier >>> - timestamp >>> >>> So, you can have two different values that have all the same >>> coordinates, except their timestamp. So for your example, that could be: >>> >>> - table: econ >>> - row key: "indicatorABC" >>> - column family: cf1 >>> - column qualifier: "reporting_2011-10-01" >>> >>> first value: >>> - timestamp: "2011-11-01 00:00:00.000" >>> - value: 2 >>> >>> second value: >>> - timestamp: "2011-12-01 00:00:00.000" >>> - value: 2.5 >>> >>> I.e., if you load the data such that the timestamps on the values >>> represent the release date, then you can model this in a natural way. By >>> default, reads in HBase will only give you the latest value, but you can >>> manually tell a scanner to give you "time travel" by only reporting values >>> as of an older date; so you could say "tell me what the data would have >>> said on 11/01". >>> >>> (Also, by default, HBase only keeps a limited number of historical >>> versions (3), but you can tell it to keep all versions.) >>> >>> There are some downsides to using the time dimension explicitly like >>> this: >>> - If you back date things and also work with deletes, you could get some >>> weird behavior depending on when compaction runs. >>> - If you have lots of versions of things, the server still has to read >>> over these when you scan, which makes things slower. (Probably doesn't >>> apply if you only have a couple historical versions of any given value.) >>> >>> All the usual caveats apply: don't bother with HBase unless you've got >>> some serious size in your data (e.g. TB) and need to support a heavy load >>> of real-time updates and queries. Otherwise, go with something simpler to >>> operate like a relational database, couchdb, etc. >>> >>> Ian >>> >>> On Feb 6, 2013, at 2:24 PM, Alex Grund wrote: >>> >>> Hi, >>> >>> I am a newbie in nosql-databases and I am wondering how to model a Ulrich Staudinger, Managing Director and Sr. Software Engineer, ActiveQuant GmbH P: +41 79 702 05 95 E: [EMAIL PROTECTED] http://www.activequant.comAQ-R user? Join our mailing list: http://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/aqr-user
-
Re: How would you model this in Hbase?
Ulrich Staudinger 2013-02-07, 07:26
> On 02/06/2013 01:49 PM, Michael Segel wrote: > >> Overloading the time stamp aka the versions of the cell is really not a >> good idea. >> >> Fully agree. > Yeah, I know opinions are like A.... everyone has one. ;-) >> >> Yeah, but some people share one. > But you have to be aware that if someone decides to delete some data... >> well one tombstone marker for the column, goodbye all of the versions of >> the cell. >> (Any ideas on a clean easy way to remove that tombstone? ;-) >> >> You're better off using other methods of adding dimension to your cell. >> One that works well is using Avro. >> >> > >>> All the usual caveats apply: don't bother with HBase unless you've got >>> some serious size in your data (e.g. TB) and need to support a heavy load >>> of real-time updates and queries. Otherwise, go with something simpler to >>> operate like a relational database, couchdb, etc. >>> >>> While this is a valid point for just storing it and working on your own with data, there are reasons why you want to choose a data integration platform (more on this later). Back to the root discussion. Why don't you simply identify the six different types of information per number: - figure name (unemployment) - month (reporting) - release date - figure - revision date - revised figure the key would be: <figure name>_<month> en voila. I strongly advise against "overloading" the timestamping/versioning feature of hbase. You would still have to load the entire series and sort it by what you like, but that's not a problem with hbase. <snip> Thinking in ActiveQuant, you would store each of the columns above through it's IArchiveWriter. Then you can seamlessly view/chart it in the ActiveQuant Master Server, making it available over CSV and SOAP to your corporate intranet or to Excel through the AQ plugin. </snip> -- Ulrich Staudinger http://www.activequant.orgConnect online: https://www.xing.com/profile/Ulrich_Staudinger
-
Re: How would you model this in Hbase?
Ian Varley 2013-02-07, 13:35
Overloading the time stamp aka the versions of the cell is really not a good idea. I agree in general, guys (and noted the dangers in my original post). I'd note, however, that this may be one of the rare cases where this actually *isn't* overloading the timestamp. If you look at the OP's question, this really is two versions of a single value. The data originally came in as X, then a month later it's revised to Y. If the majority of queries are going to just ask "what's the latest value", then this will make it easy in HBase, because that's the default behavior. And if you want to do a time travel query, that too is easy (you just set the max date you'd like to use). Doing either of those things with the reporting_month explicitly factored into the model (in the key, say) is harder. (Not impossible, just more complicated.) In a relational database, you might model this as a simple "UPDATE econ SET value = '2.5' WHERE figure='unemployment' AND month_reporting = '2011-11-01'". But the downside there is you'd lose the old value, and wouldn't be able to time travel. But in HBase you can. Overloading the timestamp is a terrible idea if you make it mean something other than "the date at which this data was valid". But that's not what's happening here, that's exactly what he's looking for. Ian On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote: On 02/06/2013 01:49 PM, Michael Segel wrote: Overloading the time stamp aka the versions of the cell is really not a good idea. Fully agree. Yeah, I know opinions are like A.... everyone has one. ;-) Yeah, but some people share one. But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. (Any ideas on a clean easy way to remove that tombstone? ;-) You're better off using other methods of adding dimension to your cell. One that works well is using Avro. All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc. While this is a valid point for just storing it and working on your own with data, there are reasons why you want to choose a data integration platform (more on this later). Back to the root discussion. Why don't you simply identify the six different types of information per number: - figure name (unemployment) - month (reporting) - release date - figure - revision date - revised figure the key would be: <figure name>_<month> en voila. I strongly advise against "overloading" the timestamping/versioning feature of hbase. You would still have to load the entire series and sort it by what you like, but that's not a problem with hbase. <snip> Thinking in ActiveQuant, you would store each of the columns above through it's IArchiveWriter. Then you can seamlessly view/chart it in the ActiveQuant Master Server, making it available over CSV and SOAP to your corporate intranet or to Excel through the AQ plugin. </snip> -- Ulrich Staudinger http://www.activequant.orgConnect online: https://www.xing.com/profile/Ulrich_Staudinger
-
Re: How would you model this in Hbase?
Ulrich Staudinger 2013-02-07, 13:51
Hi there, No offence meant Ian. I might also think too trading oriented. You definitely want to have those numbers readily available and not as a version. In retrospective, you will want to know by how much the actuals were off. Or you will want to run a trading strategy against the actuals ... It is the same with any of those macro figures. Revised and initially reported are two separate types of information and there is (usually) always a revised figure. And when doing research, I wouldn't dare start with versioning unless it is absolutely clear that the original value is wrong, void and worthless. Cheers P.s. pardon for double posting an hour ago. Am 07.02.2013 14:36 schrieb "Ian Varley" <[EMAIL PROTECTED]>: Overloading the time stamp aka the versions of the cell is really not a good idea. I agree in general, guys (and noted the dangers in my original post). I'd note, however, that this may be one of the rare cases where this actually *isn't* overloading the timestamp. If you look at the OP's question, this really is two versions of a single value. The data originally came in as X, then a month later it's revised to Y. If the majority of queries are going to just ask "what's the latest value", then this will make it easy in HBase, because that's the default behavior. And if you want to do a time travel query, that too is easy (you just set the max date you'd like to use). Doing either of those things with the reporting_month explicitly factored into the model (in the key, say) is harder. (Not impossible, just more complicated.) In a relational database, you might model this as a simple "UPDATE econ SET value = '2.5' WHERE figure='unemployment' AND month_reporting '2011-11-01'". But the downside there is you'd lose the old value, and wouldn't be able to time travel. But in HBase you can. Overloading the timestamp is a terrible idea if you make it mean something other than "the date at which this data was valid". But that's not what's happening here, that's exactly what he's looking for. Ian On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote: On 02/06/2013 01:49 PM, Michael Segel wrote: Overloading the time stamp aka the versions of the cell is really not a good idea. Fully agree. Yeah, I know opinions are like A.... everyone has one. ;-) Yeah, but some people share one. But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. (Any ideas on a clean easy way to remove that tombstone? ;-) You're better off using other methods of adding dimension to your cell. One that works well is using Avro. All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc. While this is a valid point for just storing it and working on your own with data, there are reasons why you want to choose a data integration platform (more on this later). Back to the root discussion. Why don't you simply identify the six different types of information per number: - figure name (unemployment) - month (reporting) - release date - figure - revision date - revised figure the key would be: <figure name>_<month> en voila. I strongly advise against "overloading" the timestamping/versioning feature of hbase. You would still have to load the entire series and sort it by what you like, but that's not a problem with hbase. <snip> Thinking in ActiveQuant, you would store each of the columns above through it's IArchiveWriter. Then you can seamlessly view/chart it in the ActiveQuant Master Server, making it available over CSV and SOAP to your corporate intranet or to Excel through the AQ plugin. </snip> -- Ulrich Staudinger http://www.activequant.orgConnect online: https://www.xing.com/profile/Ulrich_Staudinger
-
Re: How would you model this in Hbase?
Ian Varley 2013-02-07, 14:00
Point well taken, Ulrich - I'm not very familiar with the domain here, but what you're saying makes sense. These aren't "mistakes that are being corrected", they're really two different pieces of information, and the difference between them is interesting in and of itself. In that case, explicitly modeling it is definitely better. :) Ian On Feb 7, 2013, at 7:51 AM, Ulrich Staudinger wrote: Hi there, No offence meant Ian. I might also think too trading oriented. You definitely want to have those numbers readily available and not as a version. In retrospective, you will want to know by how much the actuals were off. Or you will want to run a trading strategy against the actuals ... It is the same with any of those macro figures. Revised and initially reported are two separate types of information and there is (usually) always a revised figure. And when doing research, I wouldn't dare start with versioning unless it is absolutely clear that the original value is wrong, void and worthless. Cheers P.s. pardon for double posting an hour ago. Am 07.02.2013 14:36 schrieb "Ian Varley" <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>: Overloading the time stamp aka the versions of the cell is really not a good idea. I agree in general, guys (and noted the dangers in my original post). I'd note, however, that this may be one of the rare cases where this actually *isn't* overloading the timestamp. If you look at the OP's question, this really is two versions of a single value. The data originally came in as X, then a month later it's revised to Y. If the majority of queries are going to just ask "what's the latest value", then this will make it easy in HBase, because that's the default behavior. And if you want to do a time travel query, that too is easy (you just set the max date you'd like to use). Doing either of those things with the reporting_month explicitly factored into the model (in the key, say) is harder. (Not impossible, just more complicated.) In a relational database, you might model this as a simple "UPDATE econ SET value = '2.5' WHERE figure='unemployment' AND month_reporting '2011-11-01'". But the downside there is you'd lose the old value, and wouldn't be able to time travel. But in HBase you can. Overloading the timestamp is a terrible idea if you make it mean something other than "the date at which this data was valid". But that's not what's happening here, that's exactly what he's looking for. Ian On Feb 7, 2013, at 1:26 AM, Ulrich Staudinger wrote: On 02/06/2013 01:49 PM, Michael Segel wrote: Overloading the time stamp aka the versions of the cell is really not a good idea. Fully agree. Yeah, I know opinions are like A.... everyone has one. ;-) Yeah, but some people share one. But you have to be aware that if someone decides to delete some data... well one tombstone marker for the column, goodbye all of the versions of the cell. (Any ideas on a clean easy way to remove that tombstone? ;-) You're better off using other methods of adding dimension to your cell. One that works well is using Avro. All the usual caveats apply: don't bother with HBase unless you've got some serious size in your data (e.g. TB) and need to support a heavy load of real-time updates and queries. Otherwise, go with something simpler to operate like a relational database, couchdb, etc. While this is a valid point for just storing it and working on your own with data, there are reasons why you want to choose a data integration platform (more on this later). Back to the root discussion. Why don't you simply identify the six different types of information per number: - figure name (unemployment) - month (reporting) - release date - figure - revision date - revised figure the key would be: <figure name>_<month> en voila. I strongly advise against "overloading" the timestamping/versioning feature of hbase. You would still have to load the entire series and sort it by what you like, but that's not a problem with hbase. <snip> Thinking in ActiveQuant, you would store each of the columns above through it's IArchiveWriter. Then you can seamlessly view/chart it in the ActiveQuant Master Server, making it available over CSV and SOAP to your corporate intranet or to Excel through the AQ plugin. </snip> Ulrich Staudinger http://www.activequant.orgConnect online: https://www.xing.com/profile/Ulrich_Staudinger
|
|