|
Eric Maland
2010-02-15, 05:32
Ryan Rawson
2010-02-15, 08:03
Saptarshi Guha
2010-03-04, 02:39
Stack
2010-03-04, 06:15
|
-
efficient storage of timeseriesEric Maland 2010-02-15, 05:32
Hi,
I'm pretty new to Hbase so bear with me - I am porting over a system for storage of timeseries data to both HBase and Cassandra. This is pretty straightforward but it opens a ton of questions about how the schema should be designed and how the data should be stored. I'm hoping some of you might have some advice on what route I should settle on. I'd like to over some ideas and questions they raise. Let's say we're fetching data about hosts like load average, disk free, etc. Ideas I've come up with thus far are: row key: host cf/qualifier: attribute:name value: whatever row timestamp: time of sampling Example: ROW COLUMN+CELL butters column=attribute:load_1, timestamp=1266208871, value=0.3 butters column=attribute:load_1, timestamp=1266208853, value=0.3 butters column=attribute:load_1, timestamp=1266208324, value=0.3 butters column=attribute:load_5, timestamp=1266208871, value=1.3 butters column=attribute:load_5, timestamp=1266208853, value=1.3 butters column=attribute:load_5, timestamp=1266208325, value=1.3 kenny column=attribute:load_1, timestamp=1266208871, value=0.6 kenny column=attribute:load_1, timestamp=1266208853, value=0.6 kenny column=attribute:load_1, timestamp=1266208324, value=0.6 etc. You can easily fetch multiple versions using setTimeRange(), and you can potentially add new data sampled for a host just by adding another qualifier for that row (eg load_5 or disk_free or whatever). The downside, or part I don't really understand here, is trying to figure out how many versions you have (count 'table' doesn't report version counts) and configuring how many versions you store, it appears to only store 3 by default. I don't know how to change that, and I don't know if versions of a row are physically stored next to the previous versions of the row or not. Is this a suboptimal configuration if we expect the data to grow to several terabytes within the year (or say, 3 months)? Another option: row key: host-timestamp cf/qualifier: attribute:name value: whatever row timestamp: time of row insert ROW COLUMN+CELL butters-1266208871 column=attribute:load_1, value=0.3 butters-1266208853 column=attribute:load_1, value=0.3 butters-1266208324 column=attribute:load_1, value=0.3 butters-1266208871 column=attribute:load_5, value=1.3 butters-1266208853 column=attribute:load_5, value=1.3 butters-1266208325 column=attribute:load_5, value=1.3 kenny-1266208871 column=attribute:load_1, value=0.6 kenny-1266208853 column=attribute:load_1, value=0.6 kenny-1266208324 column=attribute:load_1, value=0.6 ... This duplicates the timestamp in the key, which seems wasteful but gives proper counts and you can just key range searches to fetch rows in a range. Is this going to be slower but spread data out further? Is this a good idea? Is filtering on a column family's qualifier efficient (say I want load_1 for the last 2 years for a host, but there are 1000 or so qualifiers per row.. is that going to explode anywhere?)? I've also thought about: row key: host-attribute cf/qualifier: value: value: whatever row timestamp: time of row insert or: row key: host-attribute-timestamp cf/qualifier: value: value: whatever row timestamp: time of row insert But these also seem like suboptimal methods. If anyone has experience or thoughts on this I'd love to hear them! Also, I incorporated the Thrift server changes from http://issues.apache.org/jira/browse/HBASE-1744 into the current Thrift server to get at the time range and version functionality (I added a few features to the API and fixed it so it'll work with Python). If anyone is interested in having that patch for the current thrift server let me know, it's simple and I'll happily share it. I'll likely stuff it up on github soon. Thanks! Eric
-
Re: efficient storage of timeseriesRyan Rawson 2010-02-15, 08:03
In terms of efficiency - there are 2 answers to that:
- compression helps a lot - disk space on DFS is substantially cheaper than on regular RDBM systems As for the schema design, I'd have to generally suggest going with taller tables and scans than going with thick rows - this is because the current way RPCs work we must hold an entire row in memory when streaming it back to the client. If your data eventually grows extremely large, you could run into OOME. Thus taller tables with scans is the way to go. The current shell doesnt have a super elegant way to view versions right now, it's mostly row-centric. To create a table with many row versions it's part of the create options in the shell, do 'help' in there. Here is a brief example: create 'table', {FAMILY=>'famName', VERSIONS=>10000} you can also do an alter, after disabling the table: disable 'table' alter 'table', {FAMILY=>'famName', VERSIONS=>10000} enable 'table' One advantage of not using timestamps as important data is that you get versioning, which can help sometimes. (bugs, etc) Some other people have done tsdb on hbase, I'll let them speak up. On Sun, Feb 14, 2010 at 9:33 PM, Eric Maland <[EMAIL PROTECTED]> wrote: > Hi, > > I'm pretty new to Hbase so bear with me - I am porting over a system for > storage of timeseries data to both HBase and Cassandra. This is pretty > straightforward but it opens a ton of questions about how the schema should > be designed and how the data should be stored. > > I'm hoping some of you might have some advice on what route I should settle > on. I'd like to over some ideas and questions they raise. > > Let's say we're fetching data about hosts like load average, disk free, etc. > Ideas I've come up with thus far are: > > row key: host > cf/qualifier: attribute:name > value: whatever > row timestamp: time of sampling > > Example: > > ROW COLUMN+CELL > > butters column=attribute:load_1, > timestamp=1266208871, value=0.3 > butters column=attribute:load_1, > timestamp=1266208853, value=0.3 > butters column=attribute:load_1, > timestamp=1266208324, value=0.3 > butters column=attribute:load_5, > timestamp=1266208871, value=1.3 > butters column=attribute:load_5, > timestamp=1266208853, value=1.3 > butters column=attribute:load_5, > timestamp=1266208325, value=1.3 > kenny column=attribute:load_1, > timestamp=1266208871, value=0.6 > kenny column=attribute:load_1, > timestamp=1266208853, value=0.6 > kenny column=attribute:load_1, > timestamp=1266208324, value=0.6 > > etc. > > You can easily fetch multiple versions using setTimeRange(), and you > can potentially add new data sampled for a host just by adding another > qualifier for that row (eg load_5 or disk_free or whatever). > > The downside, or part I don't really understand here, is trying to figure > out how many versions you have (count 'table' doesn't report version counts) > and configuring how many versions you store, it appears to only store 3 by > default. I don't know how to change that, and I don't know if versions of a > row are physically stored next to the previous versions of the row or not. > Is this a suboptimal configuration if we expect the data to grow to several > terabytes within the year (or say, 3 months)? > > Another option: > > row key: host-timestamp > cf/qualifier: attribute:name > value: whatever > row timestamp: time of row insert > > ROW COLUMN+CELL > > butters-1266208871 column=attribute:load_1, value=0.3 > > butters-1266208853 column=attribute:load_1, value=0.3 > > butters-1266208324 column=attribute:load_1, value=0.3 > > butters-1266208871 column=attribute:load_5, value=1.3 > > butters-1266208853 column=attribute:load_5, value=1.3 > > butters-1266208325 column=attribute:load_5, value=1.3
-
Re: efficient storage of timeseriesSaptarshi Guha 2010-03-04, 02:39
But if I specify which column families I want, will it still load the
entire row? Suppose row key R has col. families A,B, C, D and A itself has 100 column labels, B has 150 and C 5, D 5 I request row key R, but specify column families C and D. Will it still load A,B? Regards Saptarshi th > taller tables and scans than going with thick rows - this is because > the current way RPCs work we must hold an entire row in memory when > streaming it back to the client. If your data eventually grows > extremely large, you could run into OOME. Thus taller tables with
-
Re: efficient storage of timeseriesStack 2010-03-04, 06:15
On Wed, Mar 3, 2010 at 6:39 PM, Saptarshi Guha <[EMAIL PROTECTED]> wrote:
> But if I specify which column families I want, will it still load the > entire row? No. It will just load the content of the specified column families. Suppose row key R has col. families A,B, C, D and A itself > has 100 column labels, B has 150 and C 5, D 5 > I request row key R, but specify column families C and D. > Will it still load A,B? No. Just C and D. St.Ack > Regards > Saptarshi > > > th >> taller tables and scans than going with thick rows - this is because >> the current way RPCs work we must hold an entire row in memory when >> streaming it back to the client. If your data eventually grows >> extremely large, you could run into OOME. Thus taller tables with > |