|
Rita
2011-08-25, 14:03
Rita
2011-08-25, 14:53
Ian Varley
2011-08-25, 15:03
Rita
2011-08-25, 15:12
Jimson K. James
2011-08-26, 03:34
Sonal Goyal
2011-08-26, 05:08
Sheng Chen
2011-08-26, 06:08
Jimson K. James
2011-08-26, 06:51
Sonal Goyal
2011-08-26, 06:58
Jimson K. James
2011-08-26, 07:17
Jimson K. James
2011-08-26, 07:26
Buttler, David
2011-08-26, 16:08
lars hofhansl
2011-08-26, 18:50
Doug Meil
2011-08-26, 19:09
Sheng Chen
2011-08-29, 02:45
|
-
schema helpRita 2011-08-25, 14:03
Hello,
I am trying to solve a time related problem. I can certainly use opentsdb for this but was wondering if anyone had a clever way to create this type of schema. I have an inventory table, time (unix epoch), fieldA, fieldB, data There are about 30 million of these entries. 95% of my queries will look like this: show me where fieldA=zCORE from range [1314180693 to now] for fieldA, there is a possibility of 4000 unique items. for fieldB, there is a possibility of 2 unique items (bool). So, I was thinking of creating 4000*2 tables and place the data like that so I can easly scan. Any thoughts about this? Will hbase freak out if i have 8000 tables? -- --- Get your facts first, then you can distort them as you please.--
-
Re: schema helpRita 2011-08-25, 14:53
Thanks for your reponse.
30 million rows is the best case :-) Couple of questions about doing, [fieldA][time] as my key: Would I have to insert in order? If no, how would hbase know to stop scanning the entire table? How would a query actually look like, if my key was [fieldA time]? As a matter of fact, I can do 100% of my queries. I will leave the 5% out of my project/schema. On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED]> wrote: > Rita, > > There's no need to create separate tables here--the table is really just a > "namespace" for keys. A better option would probably be having one table > with "[fieldA][time]" (the two fields concatenated) as your row key. Then, > you can seek directly to the start of your records in constant time, and > then scan forward until you get to the end of the data (linear time in the > size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in this > form, you may have to do a full table scan. (Alternately, you could also > maintain secondary indexes that help you get the data back with less than a > full table scan; that would depend on the nature of the queries). > > In general, a good rule of thumb when designing a schema in HBase is, think > first about how you'd ideally like to access the data. Then structure the > data to match that access pattern. (This is obviously not ideal if you have > lots of different access patterns, but then, that's what relational > databases are for. Most commercial relational DBs wouldn't blink at doing > analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use opentsdb > for this but was wondering if anyone had a clever way to create this type > of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data > > > There are about 30 million of these entries. > > 95% of my queries will look like this: > show me where fieldA=zCORE from range [1314180693 to now] > > for fieldA, there is a possibility of 4000 unique items. > for fieldB, there is a possibility of 2 unique items (bool). > > So, I was thinking of creating 4000*2 tables and place the data like that > so > I can easly scan. > > Any thoughts about this? Will hbase freak out if i have 8000 tables? > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > > > -- --- Get your facts first, then you can distort them as you please.--
-
Re: schema helpIan Varley 2011-08-25, 15:03
The rows don't need to be inserted in order; they're maintained in key-sorted order on the disk based on the architecture of HBase, which stores data sorted in memory and periodically flushes to immutable files in HDFS (which are later compacted to make read access more efficient). HBase keeps track of which physical files might contain a given key range, and only reads the ones it needs to.
To do a query through the java API, you could create a scanner with a startrow that is the concatenation of your value for fieldA and the start time, and an endrow that has the current time. http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html Ian On Aug 25, 2011, at 9:53 AM, Rita wrote: Thanks for your reponse. 30 million rows is the best case :-) Couple of questions about doing, [fieldA][time] as my key: Would I have to insert in order? If no, how would hbase know to stop scanning the entire table? How would a query actually look like, if my key was [fieldA time]? As a matter of fact, I can do 100% of my queries. I will leave the 5% out of my project/schema. On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Rita, There's no need to create separate tables here--the table is really just a "namespace" for keys. A better option would probably be having one table with "[fieldA][time]" (the two fields concatenated) as your row key. Then, you can seek directly to the start of your records in constant time, and then scan forward until you get to the end of the data (linear time in the size of data you expect to get back). The downside of this is that for the 5% of your queries that aren't in this form, you may have to do a full table scan. (Alternately, you could also maintain secondary indexes that help you get the data back with less than a full table scan; that would depend on the nature of the queries). In general, a good rule of thumb when designing a schema in HBase is, think first about how you'd ideally like to access the data. Then structure the data to match that access pattern. (This is obviously not ideal if you have lots of different access patterns, but then, that's what relational databases are for. Most commercial relational DBs wouldn't blink at doing analytical queries against 30 million rows.) Ian On Aug 25, 2011, at 9:03 AM, Rita wrote: Hello, I am trying to solve a time related problem. I can certainly use opentsdb for this but was wondering if anyone had a clever way to create this type of schema. I have an inventory table, time (unix epoch), fieldA, fieldB, data There are about 30 million of these entries. 95% of my queries will look like this: show me where fieldA=zCORE from range [1314180693 to now] for fieldA, there is a possibility of 4000 unique items. for fieldB, there is a possibility of 2 unique items (bool). So, I was thinking of creating 4000*2 tables and place the data like that so I can easly scan. Any thoughts about this? Will hbase freak out if i have 8000 tables? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.--
-
Re: schema helpRita 2011-08-25, 15:12
Thats very good to know.
I cant do the scan thru hbase shell? On Thu, Aug 25, 2011 at 11:03 AM, Ian Varley <[EMAIL PROTECTED]> wrote: > The rows don't need to be inserted in order; they're maintained in > key-sorted order on the disk based on the architecture of HBase, which > stores data sorted in memory and periodically flushes to immutable files in > HDFS (which are later compacted to make read access more efficient). HBase > keeps track of which physical files might contain a given key range, and > only reads the ones it needs to. > > To do a query through the java API, you could create a scanner with a > startrow that is the concatenation of your value for fieldA and the start > time, and an endrow that has the current time. > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > Ian > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > Thanks for your reponse. > > 30 million rows is the best case :-) > > Couple of questions about doing, [fieldA][time] as my key: > Would I have to insert in order? > If no, how would hbase know to stop scanning the entire table? > How would a query actually look like, if my key was [fieldA time]? > > As a matter of fact, I can do 100% of my queries. I will leave the 5% out > of my project/schema. > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > Rita, > > There's no need to create separate tables here--the table is really just a > "namespace" for keys. A better option would probably be having one table > with "[fieldA][time]" (the two fields concatenated) as your row key. Then, > you can seek directly to the start of your records in constant time, and > then scan forward until you get to the end of the data (linear time in the > size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in this > form, you may have to do a full table scan. (Alternately, you could also > maintain secondary indexes that help you get the data back with less than a > full table scan; that would depend on the nature of the queries). > > In general, a good rule of thumb when designing a schema in HBase is, think > first about how you'd ideally like to access the data. Then structure the > data to match that access pattern. (This is obviously not ideal if you have > lots of different access patterns, but then, that's what relational > databases are for. Most commercial relational DBs wouldn't blink at doing > analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use opentsdb > for this but was wondering if anyone had a clever way to create this type > of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data > > > There are about 30 million of these entries. > > 95% of my queries will look like this: > show me where fieldA=zCORE from range [1314180693 to now] > > for fieldA, there is a possibility of 4000 unique items. > for fieldB, there is a possibility of 2 unique items (bool). > > So, I was thinking of creating 4000*2 tables and place the data like that > so > I can easly scan. > > Any thoughts about this? Will hbase freak out if i have 8000 tables? > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > > -- --- Get your facts first, then you can distort them as you please.--
-
RE: schema helpJimson K. James 2011-08-26, 03:34
Hi Ian,
Can you just get me some reference to the key sorted architecture in hbase? Seems there is not much documentation out there. -----Original Message----- From: Ian Varley [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 25, 2011 8:33 PM To: [EMAIL PROTECTED] Subject: Re: schema help The rows don't need to be inserted in order; they're maintained in key-sorted order on the disk based on the architecture of HBase, which stores data sorted in memory and periodically flushes to immutable files in HDFS (which are later compacted to make read access more efficient). HBase keeps track of which physical files might contain a given key range, and only reads the ones it needs to. To do a query through the java API, you could create a scanner with a startrow that is the concatenation of your value for fieldA and the start time, and an endrow that has the current time. http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html Ian On Aug 25, 2011, at 9:53 AM, Rita wrote: Thanks for your reponse. 30 million rows is the best case :-) Couple of questions about doing, [fieldA][time] as my key: Would I have to insert in order? If no, how would hbase know to stop scanning the entire table? How would a query actually look like, if my key was [fieldA time]? As a matter of fact, I can do 100% of my queries. I will leave the 5% out of my project/schema. On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Rita, There's no need to create separate tables here--the table is really just a "namespace" for keys. A better option would probably be having one table with "[fieldA][time]" (the two fields concatenated) as your row key. Then, you can seek directly to the start of your records in constant time, and then scan forward until you get to the end of the data (linear time in the size of data you expect to get back). The downside of this is that for the 5% of your queries that aren't in this form, you may have to do a full table scan. (Alternately, you could also maintain secondary indexes that help you get the data back with less than a full table scan; that would depend on the nature of the queries). In general, a good rule of thumb when designing a schema in HBase is, think first about how you'd ideally like to access the data. Then structure the data to match that access pattern. (This is obviously not ideal if you have lots of different access patterns, but then, that's what relational databases are for. Most commercial relational DBs wouldn't blink at doing analytical queries against 30 million rows.) Ian On Aug 25, 2011, at 9:03 AM, Rita wrote: Hello, I am trying to solve a time related problem. I can certainly use opentsdb for this but was wondering if anyone had a clever way to create this type of schema. I have an inventory table, time (unix epoch), fieldA, fieldB, data There are about 30 million of these entries. 95% of my queries will look like this: show me where fieldA=zCORE from range [1314180693 to now] for fieldA, there is a possibility of 4000 unique items. for fieldB, there is a possibility of 2 unique items (bool). So, I was thinking of creating 4000*2 tables and place the data like that so I can easly scan. Any thoughts about this? Will hbase freak out if i have 8000 tables? -- --- Get your facts first, then you can distort them as you please.-- -- --- Get your facts first, then you can distort them as you please.-- ***** Confidentiality Statement/Disclaimer ***** This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt. The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.
-
Re: schema helpSonal Goyal 2011-08-26, 05:08
Hi Jimson,
Here are a few links that talk about the sorted architecture: http://wiki.apache.org/hadoop/Hbase/DataModel http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable i think the original BigTable paper ought to have some details too, I am sorry I havent read it recently to quote with authority. Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Aug 26, 2011 at 9:04 AM, Jimson K. James <[EMAIL PROTECTED] > wrote: > Hi Ian, > > Can you just get me some reference to the key sorted architecture in > hbase? > Seems there is not much documentation out there. > > > -----Original Message----- > From: Ian Varley [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 25, 2011 8:33 PM > To: [EMAIL PROTECTED] > Subject: Re: schema help > > The rows don't need to be inserted in order; they're maintained in > key-sorted order on the disk based on the architecture of HBase, which > stores data sorted in memory and periodically flushes to immutable files > in HDFS (which are later compacted to make read access more efficient). > HBase keeps track of which physical files might contain a given key > range, and only reads the ones it needs to. > > To do a query through the java API, you could create a scanner with a > startrow that is the concatenation of your value for fieldA and the > start time, and an endrow that has the current time. > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > Ian > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > Thanks for your reponse. > > 30 million rows is the best case :-) > > Couple of questions about doing, [fieldA][time] as my key: > Would I have to insert in order? > If no, how would hbase know to stop scanning the entire table? > How would a query actually look like, if my key was [fieldA time]? > > As a matter of fact, I can do 100% of my queries. I will leave the 5% > out of my project/schema. > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Rita, > > There's no need to create separate tables here--the table is really just > a "namespace" for keys. A better option would probably be having one > table with "[fieldA][time]" (the two fields concatenated) as your row > key. Then, you can seek directly to the start of your records in > constant time, and then scan forward until you get to the end of the > data (linear time in the size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in > this form, you may have to do a full table scan. (Alternately, you could > also maintain secondary indexes that help you get the data back with > less than a full table scan; that would depend on the nature of the > queries). > > In general, a good rule of thumb when designing a schema in HBase is, > think first about how you'd ideally like to access the data. Then > structure the data to match that access pattern. (This is obviously not > ideal if you have lots of different access patterns, but then, that's > what relational databases are for. Most commercial relational DBs > wouldn't blink at doing analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use > opentsdb > for this but was wondering if anyone had a clever way to create this > type of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data > > > There are about 30 million of these entries. > > 95% of my queries will look like this: > show me where fieldA=zCORE from range [1314180693 to now] > > for fieldA, there is a possibility of 4000 unique items. > for fieldB, there is a possibility of 2 unique items (bool). > > So, I was thinking of creating 4000*2 tables and place the data like > that so
-
Re: schema helpSheng Chen 2011-08-26, 06:08
If the rows are added with random keys and flushed periodically, is it
possible that every hfile holds almost the whole key range? Will it affect the random read performance, before the compaction is done? Thanks. Sean 2011/8/25 Ian Varley <[EMAIL PROTECTED]> > The rows don't need to be inserted in order; they're maintained in > key-sorted order on the disk based on the architecture of HBase, which > stores data sorted in memory and periodically flushes to immutable files in > HDFS (which are later compacted to make read access more efficient). HBase > keeps track of which physical files might contain a given key range, and > only reads the ones it needs to. > > To do a query through the java API, you could create a scanner with a > startrow that is the concatenation of your value for fieldA and the start > time, and an endrow that has the current time. > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > Ian > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > Thanks for your reponse. > > 30 million rows is the best case :-) > > Couple of questions about doing, [fieldA][time] as my key: > Would I have to insert in order? > If no, how would hbase know to stop scanning the entire table? > How would a query actually look like, if my key was [fieldA time]? > > As a matter of fact, I can do 100% of my queries. I will leave the 5% out > of my project/schema. > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > Rita, > > There's no need to create separate tables here--the table is really just a > "namespace" for keys. A better option would probably be having one table > with "[fieldA][time]" (the two fields concatenated) as your row key. Then, > you can seek directly to the start of your records in constant time, and > then scan forward until you get to the end of the data (linear time in the > size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in this > form, you may have to do a full table scan. (Alternately, you could also > maintain secondary indexes that help you get the data back with less than a > full table scan; that would depend on the nature of the queries). > > In general, a good rule of thumb when designing a schema in HBase is, think > first about how you'd ideally like to access the data. Then structure the > data to match that access pattern. (This is obviously not ideal if you have > lots of different access patterns, but then, that's what relational > databases are for. Most commercial relational DBs wouldn't blink at doing > analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use opentsdb > for this but was wondering if anyone had a clever way to create this type > of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data > > > There are about 30 million of these entries. > > 95% of my queries will look like this: > show me where fieldA=zCORE from range [1314180693 to now] > > for fieldA, there is a possibility of 4000 unique items. > for fieldB, there is a possibility of 2 unique items (bool). > > So, I was thinking of creating 4000*2 tables and place the data like that > so > I can easly scan. > > Any thoughts about this? Will hbase freak out if i have 8000 tables? > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > >
-
RE: schema helpJimson K. James 2011-08-26, 06:51
Hi Sonal,
Nice references, thank you :) What I'm currently after is the data distribution in Hbase, Is there any hbase hit ratio measuring tool? Searching for some ways to get hit ratio per region, Is it possible? Thanks, -----Original Message----- From: Sonal Goyal [mailto:[EMAIL PROTECTED]] Sent: Friday, August 26, 2011 10:38 AM To: [EMAIL PROTECTED] Subject: Re: schema help Hi Jimson, Here are a few links that talk about the sorted architecture: http://wiki.apache.org/hadoop/Hbase/DataModel http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable i think the original BigTable paper ought to have some details too, I am sorry I havent read it recently to quote with authority. Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Aug 26, 2011 at 9:04 AM, Jimson K. James <[EMAIL PROTECTED] > wrote: > Hi Ian, > > Can you just get me some reference to the key sorted architecture in > hbase? > Seems there is not much documentation out there. > > > -----Original Message----- > From: Ian Varley [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 25, 2011 8:33 PM > To: [EMAIL PROTECTED] > Subject: Re: schema help > > The rows don't need to be inserted in order; they're maintained in > key-sorted order on the disk based on the architecture of HBase, which > stores data sorted in memory and periodically flushes to immutable files > in HDFS (which are later compacted to make read access more efficient). > HBase keeps track of which physical files might contain a given key > range, and only reads the ones it needs to. > > To do a query through the java API, you could create a scanner with a > startrow that is the concatenation of your value for fieldA and the > start time, and an endrow that has the current time. > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > Ian > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > Thanks for your reponse. > > 30 million rows is the best case :-) > > Couple of questions about doing, [fieldA][time] as my key: > Would I have to insert in order? > If no, how would hbase know to stop scanning the entire table? > How would a query actually look like, if my key was [fieldA time]? > > As a matter of fact, I can do 100% of my queries. I will leave the 5% > out of my project/schema. > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > Rita, > > There's no need to create separate tables here--the table is really just > a "namespace" for keys. A better option would probably be having one > table with "[fieldA][time]" (the two fields concatenated) as your row > key. Then, you can seek directly to the start of your records in > constant time, and then scan forward until you get to the end of the > data (linear time in the size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in > this form, you may have to do a full table scan. (Alternately, you could > also maintain secondary indexes that help you get the data back with > less than a full table scan; that would depend on the nature of the > queries). > > In general, a good rule of thumb when designing a schema in HBase is, > think first about how you'd ideally like to access the data. Then > structure the data to match that access pattern. (This is obviously not > ideal if you have lots of different access patterns, but then, that's > what relational databases are for. Most commercial relational DBs > wouldn't blink at doing analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use > opentsdb > for this but was wondering if anyone had a clever way to create this > type of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data prohibited. If immediately the alterations content of ***** Confidentiality Statement/Disclaimer ***** This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt. The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.
-
Re: schema helpSonal Goyal 2011-08-26, 06:58
Hi Jimson,
Are you talking about hbase.regionserver.blockCacheHitRatio ? http://hbase.apache.org/book/rs_metrics.html Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Aug 26, 2011 at 12:21 PM, Jimson K. James < [EMAIL PROTECTED]> wrote: > Hi Sonal, > > Nice references, thank you :) > What I'm currently after is the data distribution in Hbase, Is there any > hbase hit ratio measuring tool? > Searching for some ways to get hit ratio per region, Is it possible? > > Thanks, > > -----Original Message----- > From: Sonal Goyal [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 26, 2011 10:38 AM > To: [EMAIL PROTECTED] > Subject: Re: schema help > > Hi Jimson, > > Here are a few links that talk about the sorted architecture: > > http://wiki.apache.org/hadoop/Hbase/DataModel > http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable > > i think the original BigTable paper ought to have some details too, I am > sorry I havent read it recently to quote with authority. > > Best Regards, > Sonal > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Fri, Aug 26, 2011 at 9:04 AM, Jimson K. James > <[EMAIL PROTECTED] > > wrote: > > > Hi Ian, > > > > Can you just get me some reference to the key sorted architecture in > > hbase? > > Seems there is not much documentation out there. > > > > > > -----Original Message----- > > From: Ian Varley [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, August 25, 2011 8:33 PM > > To: [EMAIL PROTECTED] > > Subject: Re: schema help > > > > The rows don't need to be inserted in order; they're maintained in > > key-sorted order on the disk based on the architecture of HBase, which > > stores data sorted in memory and periodically flushes to immutable > files > > in HDFS (which are later compacted to make read access more > efficient). > > HBase keeps track of which physical files might contain a given key > > range, and only reads the ones it needs to. > > > > To do a query through the java API, you could create a scanner with a > > startrow that is the concatenation of your value for fieldA and the > > start time, and an endrow that has the current time. > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > > > Ian > > > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > > > Thanks for your reponse. > > > > 30 million rows is the best case :-) > > > > Couple of questions about doing, [fieldA][time] as my key: > > Would I have to insert in order? > > If no, how would hbase know to stop scanning the entire table? > > How would a query actually look like, if my key was [fieldA time]? > > > > As a matter of fact, I can do 100% of my queries. I will leave the 5% > > out of my project/schema. > > > > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley > > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > > Rita, > > > > There's no need to create separate tables here--the table is really > just > > a "namespace" for keys. A better option would probably be having one > > table with "[fieldA][time]" (the two fields concatenated) as your row > > key. Then, you can seek directly to the start of your records in > > constant time, and then scan forward until you get to the end of the > > data (linear time in the size of data you expect to get back). > > > > The downside of this is that for the 5% of your queries that aren't in > > this form, you may have to do a full table scan. (Alternately, you > could > > also maintain secondary indexes that help you get the data back with > > less than a full table scan; that would depend on the nature of the > > queries). > > > > In general, a good rule of thumb when designing a schema in HBase is, > > think first about how you'd ideally like to access the data. Then
-
RE: schema helpJimson K. James 2011-08-26, 07:17
Hi Sonal,
Not really a cache hit ratio. I'll explain. Let's assume we have 3 regions distributed over 3 region servers. If we read a key/value, can we say regionserver 1 being the owner of that key/value got a hit? If we then read 10 more keys, of those first 5 hit region server 2, being the owner of those keys, the hit count of that region is 5 while the hit count of region 1 is still 1. -----Original Message----- From: Sonal Goyal [mailto:[EMAIL PROTECTED]] Sent: Friday, August 26, 2011 12:28 PM To: [EMAIL PROTECTED] Subject: Re: schema help Hi Jimson, Are you talking about hbase.regionserver.blockCacheHitRatio ? http://hbase.apache.org/book/rs_metrics.html Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Aug 26, 2011 at 12:21 PM, Jimson K. James < [EMAIL PROTECTED]> wrote: > Hi Sonal, > > Nice references, thank you :) > What I'm currently after is the data distribution in Hbase, Is there any > hbase hit ratio measuring tool? > Searching for some ways to get hit ratio per region, Is it possible? > > Thanks, > > -----Original Message----- > From: Sonal Goyal [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 26, 2011 10:38 AM > To: [EMAIL PROTECTED] > Subject: Re: schema help > > Hi Jimson, > > Here are a few links that talk about the sorted architecture: > > http://wiki.apache.org/hadoop/Hbase/DataModel > http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable > > i think the original BigTable paper ought to have some details too, I am > sorry I havent read it recently to quote with authority. > > Best Regards, > Sonal > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Fri, Aug 26, 2011 at 9:04 AM, Jimson K. James > <[EMAIL PROTECTED] > > wrote: > > > Hi Ian, > > > > Can you just get me some reference to the key sorted architecture in > > hbase? > > Seems there is not much documentation out there. > > > > > > -----Original Message----- > > From: Ian Varley [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, August 25, 2011 8:33 PM > > To: [EMAIL PROTECTED] > > Subject: Re: schema help > > > > The rows don't need to be inserted in order; they're maintained in > > key-sorted order on the disk based on the architecture of HBase, which > > stores data sorted in memory and periodically flushes to immutable > files > > in HDFS (which are later compacted to make read access more > efficient). > > HBase keeps track of which physical files might contain a given key > > range, and only reads the ones it needs to. > > > > To do a query through the java API, you could create a scanner with a > > startrow that is the concatenation of your value for fieldA and the > > start time, and an endrow that has the current time. > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > > > Ian > > > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > > > Thanks for your reponse. > > > > 30 million rows is the best case :-) > > > > Couple of questions about doing, [fieldA][time] as my key: > > Would I have to insert in order? > > If no, how would hbase know to stop scanning the entire table? > > How would a query actually look like, if my key was [fieldA time]? > > > > As a matter of fact, I can do 100% of my queries. I will leave the 5% > > out of my project/schema. > > > > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley > > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > > Rita, > > > > There's no need to create separate tables here--the table is really > just > > a "namespace" for keys. A better option would probably be having one > > table with "[fieldA][time]" (the two fields concatenated) as your row > > key. Then, you can seek directly to the start of your records in > > constant time, and then scan forward until you get to the end of the in is, that's of prohibited. If immediately the alterations content of ***** Confidentiality Statement/Disclaimer ***** This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt. The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.
-
RE: schema helpJimson K. James 2011-08-26, 07:26
Something close, hbase.regionserver.requests
-----Original Message----- From: Jimson K. James [mailto:[EMAIL PROTECTED]] Sent: Friday, August 26, 2011 12:47 PM To: [EMAIL PROTECTED] Subject: RE: schema help Hi Sonal, Not really a cache hit ratio. I'll explain. Let's assume we have 3 regions distributed over 3 region servers. If we read a key/value, can we say regionserver 1 being the owner of that key/value got a hit? If we then read 10 more keys, of those first 5 hit region server 2, being the owner of those keys, the hit count of that region is 5 while the hit count of region 1 is still 1. -----Original Message----- From: Sonal Goyal [mailto:[EMAIL PROTECTED]] Sent: Friday, August 26, 2011 12:28 PM To: [EMAIL PROTECTED] Subject: Re: schema help Hi Jimson, Are you talking about hbase.regionserver.blockCacheHitRatio ? http://hbase.apache.org/book/rs_metrics.html Best Regards, Sonal Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> Nube Technologies <http://www.nubetech.co> <http://in.linkedin.com/in/sonalgoyal> On Fri, Aug 26, 2011 at 12:21 PM, Jimson K. James < [EMAIL PROTECTED]> wrote: > Hi Sonal, > > Nice references, thank you :) > What I'm currently after is the data distribution in Hbase, Is there any > hbase hit ratio measuring tool? > Searching for some ways to get hit ratio per region, Is it possible? > > Thanks, > > -----Original Message----- > From: Sonal Goyal [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 26, 2011 10:38 AM > To: [EMAIL PROTECTED] > Subject: Re: schema help > > Hi Jimson, > > Here are a few links that talk about the sorted architecture: > > http://wiki.apache.org/hadoop/Hbase/DataModel > http://jimbojw.com/wiki/index.php?title=Understanding_Hbase_and_BigTable > > i think the original BigTable paper ought to have some details too, I am > sorry I havent read it recently to quote with authority. > > Best Regards, > Sonal > Crux: Reporting for HBase <https://github.com/sonalgoyal/crux> > Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > > > On Fri, Aug 26, 2011 at 9:04 AM, Jimson K. James > <[EMAIL PROTECTED] > > wrote: > > > Hi Ian, > > > > Can you just get me some reference to the key sorted architecture in > > hbase? > > Seems there is not much documentation out there. > > > > > > -----Original Message----- > > From: Ian Varley [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, August 25, 2011 8:33 PM > > To: [EMAIL PROTECTED] > > Subject: Re: schema help > > > > The rows don't need to be inserted in order; they're maintained in > > key-sorted order on the disk based on the architecture of HBase, which > > stores data sorted in memory and periodically flushes to immutable > files > > in HDFS (which are later compacted to make read access more > efficient). > > HBase keeps track of which physical files might contain a given key > > range, and only reads the ones it needs to. > > > > To do a query through the java API, you could create a scanner with a > > startrow that is the concatenation of your value for fieldA and the > > start time, and an endrow that has the current time. > > > > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > > > Ian > > > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > > > Thanks for your reponse. > > > > 30 million rows is the best case :-) > > > > Couple of questions about doing, [fieldA][time] as my key: > > Would I have to insert in order? > > If no, how would hbase know to stop scanning the entire table? > > How would a query actually look like, if my key was [fieldA time]? > > > > As a matter of fact, I can do 100% of my queries. I will leave the 5% > > out of my project/schema. > > > > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley > > <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > > Rita, > > > > There's no need to create separate tables here--the table is really > just > > a "namespace" for keys. A better option would probably be having one row in is, that's of prohibited. If immediately the alterations content of ***** Confidentiality Statement/Disclaimer ***** This message and any attachments is intended for the sole use of the intended recipient. It may contain confidential information. Any unauthorized use, dissemination or modification is strictly prohibited. If you are not the intended recipient, please notify the sender immediately then delete it from all your systems, and do not copy, use or print. Internet communications are not secure and it is the responsibility of the recipient to make sure that it is virus/malicious code exempt. The company/sender cannot be responsible for any unauthorized alterations or modifications made to the contents. If you require any form of confirmation of the contents, please contact the company/sender. The company/sender is not liable for any errors or omissions in the content of this message.
-
RE: schema helpButtler, David 2011-08-26, 16:08
No. The gist of how hbase works is, even on random writes, each key always goes to the one region that contains that key range. If a region gets too large, it automatically splits. Keys that are close together always end up in the same region (or an adjacent region).
Dave -----Original Message----- From: Sheng Chen [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 25, 2011 11:09 PM To: [EMAIL PROTECTED] Subject: Re: schema help If the rows are added with random keys and flushed periodically, is it possible that every hfile holds almost the whole key range? Will it affect the random read performance, before the compaction is done? Thanks. Sean 2011/8/25 Ian Varley <[EMAIL PROTECTED]> > The rows don't need to be inserted in order; they're maintained in > key-sorted order on the disk based on the architecture of HBase, which > stores data sorted in memory and periodically flushes to immutable files in > HDFS (which are later compacted to make read access more efficient). HBase > keeps track of which physical files might contain a given key range, and > only reads the ones it needs to. > > To do a query through the java API, you could create a scanner with a > startrow that is the concatenation of your value for fieldA and the start > time, and an endrow that has the current time. > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > Ian > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > Thanks for your reponse. > > 30 million rows is the best case :-) > > Couple of questions about doing, [fieldA][time] as my key: > Would I have to insert in order? > If no, how would hbase know to stop scanning the entire table? > How would a query actually look like, if my key was [fieldA time]? > > As a matter of fact, I can do 100% of my queries. I will leave the 5% out > of my project/schema. > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > Rita, > > There's no need to create separate tables here--the table is really just a > "namespace" for keys. A better option would probably be having one table > with "[fieldA][time]" (the two fields concatenated) as your row key. Then, > you can seek directly to the start of your records in constant time, and > then scan forward until you get to the end of the data (linear time in the > size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in this > form, you may have to do a full table scan. (Alternately, you could also > maintain secondary indexes that help you get the data back with less than a > full table scan; that would depend on the nature of the queries). > > In general, a good rule of thumb when designing a schema in HBase is, think > first about how you'd ideally like to access the data. Then structure the > data to match that access pattern. (This is obviously not ideal if you have > lots of different access patterns, but then, that's what relational > databases are for. Most commercial relational DBs wouldn't blink at doing > analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use opentsdb > for this but was wondering if anyone had a clever way to create this type > of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data > > > There are about 30 million of these entries. > > 95% of my queries will look like this: > show me where fieldA=zCORE from range [1314180693 to now] > > for fieldA, there is a possibility of 4000 unique items. > for fieldB, there is a possibility of 2 unique items (bool). > > So, I was thinking of creating 4000*2 tables and place the data like that > so > I can easly scan. > > Any thoughts about this? Will hbase freak out if i have 8000 tables? > > > > > > > -- > --- Get your facts first, then you can distort them as you please.-- > > > > > -- > --- Get your facts first, then you can distort them as you please.--
-
Re: schema helplars hofhansl 2011-08-26, 18:50
In nutshell a change to HBase is performed like this:
1. the WAL entry is written and sync'ed to disk 2. The memstore is updated (that's just a cache in memory). 3. When memstore reaches a certain size it is flushed to create a new file. 4. When a certain number of files is reached, they are compacted (combined into fewer files) When you do a read, HBase scans the memstore and all relevant store files. It does that similar to what a mergesort does. -- Lars ________________________________ From: Sheng Chen <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, August 25, 2011 11:08 PM Subject: Re: schema help If the rows are added with random keys and flushed periodically, is it possible that every hfile holds almost the whole key range? Will it affect the random read performance, before the compaction is done? Thanks. Sean 2011/8/25 Ian Varley <[EMAIL PROTECTED]> > The rows don't need to be inserted in order; they're maintained in > key-sorted order on the disk based on the architecture of HBase, which > stores data sorted in memory and periodically flushes to immutable files in > HDFS (which are later compacted to make read access more efficient). HBase > keeps track of which physical files might contain a given key range, and > only reads the ones it needs to. > > To do a query through the java API, you could create a scanner with a > startrow that is the concatenation of your value for fieldA and the start > time, and an endrow that has the current time. > > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > > Ian > > On Aug 25, 2011, at 9:53 AM, Rita wrote: > > Thanks for your reponse. > > 30 million rows is the best case :-) > > Couple of questions about doing, [fieldA][time] as my key: > Would I have to insert in order? > If no, how would hbase know to stop scanning the entire table? > How would a query actually look like, if my key was [fieldA time]? > > As a matter of fact, I can do 100% of my queries. I will leave the 5% out > of my project/schema. > > > On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > Rita, > > There's no need to create separate tables here--the table is really just a > "namespace" for keys. A better option would probably be having one table > with "[fieldA][time]" (the two fields concatenated) as your row key. Then, > you can seek directly to the start of your records in constant time, and > then scan forward until you get to the end of the data (linear time in the > size of data you expect to get back). > > The downside of this is that for the 5% of your queries that aren't in this > form, you may have to do a full table scan. (Alternately, you could also > maintain secondary indexes that help you get the data back with less than a > full table scan; that would depend on the nature of the queries). > > In general, a good rule of thumb when designing a schema in HBase is, think > first about how you'd ideally like to access the data. Then structure the > data to match that access pattern. (This is obviously not ideal if you have > lots of different access patterns, but then, that's what relational > databases are for. Most commercial relational DBs wouldn't blink at doing > analytical queries against 30 million rows.) > > Ian > > On Aug 25, 2011, at 9:03 AM, Rita wrote: > > Hello, > > I am trying to solve a time related problem. I can certainly use opentsdb > for this but was wondering if anyone had a clever way to create this type > of > schema. > > I have an inventory table, > > time (unix epoch), fieldA, fieldB, data > > > There are about 30 million of these entries. > > 95% of my queries will look like this: > show me where fieldA=zCORE from range [1314180693 to now] > > for fieldA, there is a possibility of 4000 unique items. > for fieldB, there is a possibility of 2 unique items (bool). > > So, I was thinking of creating 4000*2 tables and place the data like that > so > I can easly scan.
-
Re: schema helpDoug Meil 2011-08-26, 19:09
+1 on everything said so far... Sean, you might also want to check this: http://hbase.apache.org/book.html#architecture On 8/26/11 2:50 PM, "lars hofhansl" <[EMAIL PROTECTED]> wrote: >In nutshell a change to HBase is performed like this: >1. the WAL entry is written and sync'ed to disk >2. The memstore is updated (that's just a cache in memory). >3. When memstore reaches a certain size it is flushed to create a new >file. >4. When a certain number of files is reached, they are compacted >(combined into fewer files) > > >When you do a read, HBase scans the memstore and all relevant store files. >It does that similar to what a mergesort does. > >-- Lars > > > >________________________________ >From: Sheng Chen <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Thursday, August 25, 2011 11:08 PM >Subject: Re: schema help > >If the rows are added with random keys and flushed periodically, is it >possible that every hfile holds almost the whole key range? >Will it affect the random read performance, before the compaction is done? > >Thanks. > >Sean > >2011/8/25 Ian Varley <[EMAIL PROTECTED]> > >> The rows don't need to be inserted in order; they're maintained in >> key-sorted order on the disk based on the architecture of HBase, which >> stores data sorted in memory and periodically flushes to immutable >>files in >> HDFS (which are later compacted to make read access more efficient). >>HBase >> keeps track of which physical files might contain a given key range, and >> only reads the ones it needs to. >> >> To do a query through the java API, you could create a scanner with a >> startrow that is the concatenation of your value for fieldA and the >>start >> time, and an endrow that has the current time. >> >> http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html >> >> Ian >> >> On Aug 25, 2011, at 9:53 AM, Rita wrote: >> >> Thanks for your reponse. >> >> 30 million rows is the best case :-) >> >> Couple of questions about doing, [fieldA][time] as my key: >> Would I have to insert in order? >> If no, how would hbase know to stop scanning the entire table? >> How would a query actually look like, if my key was [fieldA time]? >> >> As a matter of fact, I can do 100% of my queries. I will leave the 5% >>out >> of my project/schema. >> >> >> On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED] >> <mailto:[EMAIL PROTECTED]>> wrote: >> Rita, >> >> There's no need to create separate tables here--the table is really >>just a >> "namespace" for keys. A better option would probably be having one table >> with "[fieldA][time]" (the two fields concatenated) as your row key. >>Then, >> you can seek directly to the start of your records in constant time, and >> then scan forward until you get to the end of the data (linear time in >>the >> size of data you expect to get back). >> >> The downside of this is that for the 5% of your queries that aren't in >>this >> form, you may have to do a full table scan. (Alternately, you could also >> maintain secondary indexes that help you get the data back with less >>than a >> full table scan; that would depend on the nature of the queries). >> >> In general, a good rule of thumb when designing a schema in HBase is, >>think >> first about how you'd ideally like to access the data. Then structure >>the >> data to match that access pattern. (This is obviously not ideal if you >>have >> lots of different access patterns, but then, that's what relational >> databases are for. Most commercial relational DBs wouldn't blink at >>doing >> analytical queries against 30 million rows.) >> >> Ian >> >> On Aug 25, 2011, at 9:03 AM, Rita wrote: >> >> Hello, >> >> I am trying to solve a time related problem. I can certainly use >>opentsdb >> for this but was wondering if anyone had a clever way to create this >>type >> of >> schema. >> >> I have an inventory table, >> >> time (unix epoch), fieldA, fieldB, data >> >> >> There are about 30 million of these entries.
-
Re: schema helpSheng Chen 2011-08-29, 02:45
Thanks all.
The HFile and key range I meant are all within one region. If the compactions are not done in time, it is possible to have many HFiles in a region holding most of the key range of the region. When reading, HBase will have to read many HFiles that may hold the key until it finds the right one. Will the bloom filter solve this problem? Or, do I always need to compact when a region is holding hundreds of hfiles? Regards, Sean 2011/8/27 Doug Meil <[EMAIL PROTECTED]> > > +1 on everything said so far... > > Sean, you might also want to check this: > http://hbase.apache.org/book.html#architecture > > > > > > On 8/26/11 2:50 PM, "lars hofhansl" <[EMAIL PROTECTED]> wrote: > > >In nutshell a change to HBase is performed like this: > >1. the WAL entry is written and sync'ed to disk > >2. The memstore is updated (that's just a cache in memory). > >3. When memstore reaches a certain size it is flushed to create a new > >file. > >4. When a certain number of files is reached, they are compacted > >(combined into fewer files) > > > > > >When you do a read, HBase scans the memstore and all relevant store files. > >It does that similar to what a mergesort does. > > > >-- Lars > > > > > > > >________________________________ > >From: Sheng Chen <[EMAIL PROTECTED]> > >To: [EMAIL PROTECTED] > >Sent: Thursday, August 25, 2011 11:08 PM > >Subject: Re: schema help > > > >If the rows are added with random keys and flushed periodically, is it > >possible that every hfile holds almost the whole key range? > >Will it affect the random read performance, before the compaction is done? > > > >Thanks. > > > >Sean > > > >2011/8/25 Ian Varley <[EMAIL PROTECTED]> > > > >> The rows don't need to be inserted in order; they're maintained in > >> key-sorted order on the disk based on the architecture of HBase, which > >> stores data sorted in memory and periodically flushes to immutable > >>files in > >> HDFS (which are later compacted to make read access more efficient). > >>HBase > >> keeps track of which physical files might contain a given key range, and > >> only reads the ones it needs to. > >> > >> To do a query through the java API, you could create a scanner with a > >> startrow that is the concatenation of your value for fieldA and the > >>start > >> time, and an endrow that has the current time. > >> > >> > http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html > >> > >> Ian > >> > >> On Aug 25, 2011, at 9:53 AM, Rita wrote: > >> > >> Thanks for your reponse. > >> > >> 30 million rows is the best case :-) > >> > >> Couple of questions about doing, [fieldA][time] as my key: > >> Would I have to insert in order? > >> If no, how would hbase know to stop scanning the entire table? > >> How would a query actually look like, if my key was [fieldA time]? > >> > >> As a matter of fact, I can do 100% of my queries. I will leave the 5% > >>out > >> of my project/schema. > >> > >> > >> On Thu, Aug 25, 2011 at 10:13 AM, Ian Varley <[EMAIL PROTECTED] > >> <mailto:[EMAIL PROTECTED]>> wrote: > >> Rita, > >> > >> There's no need to create separate tables here--the table is really > >>just a > >> "namespace" for keys. A better option would probably be having one table > >> with "[fieldA][time]" (the two fields concatenated) as your row key. > >>Then, > >> you can seek directly to the start of your records in constant time, and > >> then scan forward until you get to the end of the data (linear time in > >>the > >> size of data you expect to get back). > >> > >> The downside of this is that for the 5% of your queries that aren't in > >>this > >> form, you may have to do a full table scan. (Alternately, you could also > >> maintain secondary indexes that help you get the data back with less > >>than a > >> full table scan; that would depend on the nature of the queries). > >> > >> In general, a good rule of thumb when designing a schema in HBase is, > >>think > >> first about how you'd ideally like to access the data. Then structure |