|
Pan, Thomas
2012-02-15, 21:57
Todd Lipcon
2012-02-15, 22:02
Stack
2012-02-15, 22:07
Pan, Thomas
2012-02-17, 21:26
Pan, Thomas
2012-02-17, 18:49
Vladimir Rodionov
2012-02-15, 22:26
Jacques
2012-02-15, 23:45
Vladimir Rodionov
2012-02-16, 00:11
Andrew Purtell
2012-02-16, 01:43
Pan, Thomas
2012-02-17, 18:55
Jacques
2012-02-17, 22:46
Pan, Thomas
2012-02-18, 07:25
M. C. Srivas
2012-02-19, 16:38
Mikael Sitruk
2012-02-19, 21:45
Jean-Daniel Cryans
2012-02-21, 20:08
Mikael Sitruk
2012-02-21, 21:17
Jean-Daniel Cryans
2012-02-21, 21:40
Mikael Sitruk
2012-02-21, 21:57
Jean-Daniel Cryans
2012-02-21, 22:13
Mikael Sitruk
2012-02-21, 22:30
Jean-Daniel Cryans
2012-02-21, 23:31
Stack
2012-02-22, 01:33
M. C. Srivas
2012-02-22, 01:44
Jean-Daniel Cryans
2012-02-22, 01:56
Stack
2012-02-22, 02:16
M. C. Srivas
2012-02-22, 05:29
Stack
2012-02-22, 05:58
M. C. Srivas
2012-02-24, 06:34
Jean-Daniel Cryans
2012-02-21, 20:05
Pan, Thomas
2012-02-24, 18:44
Stack
2012-02-24, 18:54
Pan, Thomas
2012-02-25, 00:20
|
-
Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-15, 21:57
Since Hbase is tailored to handle one table very well, we are thinking to put multiple tables into one big table but on different column family sets. Our use case is full table scan against single column value filters. As records from different "logical tables" are at different column families, could we speed up the scan performance by simply checking the column family referenced by these single column value filters first before really going through all the underlying K-V pairs? It would be great if the Hbase code is already coded that way. $0.02, Thomas +
Pan, Thomas 2012-02-15, 21:57
-
Re: Scan performance on a big table as combination of multiple logic tablesTodd Lipcon 2012-02-15, 22:02
Hi Thomas,
The issue with combining multiple tables into different CFs of one table is that the tables will get tied together for flush/compact operations. If the workload between them differs significantly you might introduce bad inefficiency for one or the other. See HBASE-3149. -Todd On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > Since Hbase is tailored to handle one table very well, we are thinking to put multiple tables into one big table but on different column family sets. Our use case is full table scan against single column value filters. As records from different "logical tables" are at different column families, could we speed up the scan performance by simply checking the column family referenced by these single column value filters first before really going through all the underlying K-V pairs? It would be great if the Hbase code is already coded that way. > > > $0.02, > Thomas > -- Todd Lipcon Software Engineer, Cloudera +
Todd Lipcon 2012-02-15, 22:02
-
Re: Scan performance on a big table as combination of multiple logic tablesStack 2012-02-15, 22:07
On Wed, Feb 15, 2012 at 2:02 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> Hi Thomas, > > The issue with combining multiple tables into different CFs of one > table is that the tables will get tied together for flush/compact > operations. If the workload between them differs significantly you > might introduce bad inefficiency for one or the other. See HBASE-3149. > Are the two column families bulk loaded at the same time Thomas? Updates come in as trickles over the API but main loading is via bulk load (across the multiple column families?)? St.Ack +
Stack 2012-02-15, 22:07
-
Re: Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-17, 21:26
Currently, bulk load is for bootstrapping the table(s) while random write is the way to go, which we could assume that the operations are evenly distributed across the time for all the column families. -Thomas On 2/15/12 2:07 PM, "Stack" <[EMAIL PROTECTED]> wrote: >On Wed, Feb 15, 2012 at 2:02 PM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> Hi Thomas, >> >> The issue with combining multiple tables into different CFs of one >> table is that the tables will get tied together for flush/compact >> operations. If the workload between them differs significantly you >> might introduce bad inefficiency for one or the other. See HBASE-3149. >> > >Are the two column families bulk loaded at the same time Thomas? > >Updates come in as trickles over the API but main loading is via bulk >load (across the multiple column families?)? > >St.Ack +
Pan, Thomas 2012-02-17, 21:26
-
Re: Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-17, 18:49
In our case, we have similar updating patterns for completed and live items. $0.02, -Thomas On 2/15/12 2:02 PM, "Todd Lipcon" <[EMAIL PROTECTED]> wrote: >Hi Thomas, > >The issue with combining multiple tables into different CFs of one >table is that the tables will get tied together for flush/compact >operations. If the workload between them differs significantly you >might introduce bad inefficiency for one or the other. See HBASE-3149. > >-Todd > >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: >> >> Since Hbase is tailored to handle one table very well, we are thinking >>to put multiple tables into one big table but on different column family >>sets. Our use case is full table scan against single column value >>filters. As records from different "logical tables" are at different >>column families, could we speed up the scan performance by simply >>checking the column family referenced by these single column value >>filters first before really going through all the underlying K-V pairs? >>It would be great if the Hbase code is already coded that way. >> >> >> $0.02, >> Thomas >> > > > >-- >Todd Lipcon >Software Engineer, Cloudera +
Pan, Thomas 2012-02-17, 18:49
-
RE: Scan performance on a big table as combination of multiple logic tablesVladimir Rodionov 2012-02-15, 22:26
I think having unique row-prefix for every table is a standard way of storing multiple virtual tables inside one BigTable's table
You get data locality per every virtual table and in this case you can easily specify start and stop rows for a Scan. Assigning separate CF to a virtual table is a bad idea because you will get data from different virtual tables mixed as since CF comes after row-key in default BigTable (HBase) comparison routine. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: [EMAIL PROTECTED] ________________________________________ From: Pan, Thomas [[EMAIL PROTECTED]] Sent: Wednesday, February 15, 2012 1:57 PM To: [EMAIL PROTECTED] Subject: Scan performance on a big table as combination of multiple logic tables Since Hbase is tailored to handle one table very well, we are thinking to put multiple tables into one big table but on different column family sets. Our use case is full table scan against single column value filters. As records from different "logical tables" are at different column families, could we speed up the scan performance by simply checking the column family referenced by these single column value filters first before really going through all the underlying K-V pairs? It would be great if the Hbase code is already coded that way. $0.02, Thomas Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or [EMAIL PROTECTED] and delete or destroy any copy of this message and its attachments. +
Vladimir Rodionov 2012-02-15, 22:26
-
Re: Scan performance on a big table as combination of multiple logic tablesJacques 2012-02-15, 23:45
Out of curiosity, what do you perceive as the benefit to having only one
table? Are there reasons that you think one table would perform better than a few? If you're splitting data within a table because you'd otherwise have millions of tables, I understand that and would concur with Vladimir's approach below. However, if you're really looking at 10 tables versus one table, it seems like HBase is built exactly to make that work well (rather than having to make all sorts of application level code to do what HBase already does). thanks, Jacques On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > Since Hbase is tailored to handle one table very well, we are thinking to > put multiple tables into one big table but on different column family sets. > Our use case is full table scan against single column value filters. As > records from different "logical tables" are at different column families, > could we speed up the scan performance by simply checking the column family > referenced by these single column value filters first before really going > through all the underlying K-V pairs? It would be great if the Hbase code > is already coded that way. > > > $0.02, > Thomas > > +
Jacques 2012-02-15, 23:45
-
RE: Scan performance on a big table as combination of multiple logic tablesVladimir Rodionov 2012-02-16, 00:11
10 tables are fine. 1000 are not, especially when one does table pre-splitting to increase write perf.
Too many regions kill HBase. Best regards, Vladimir Rodionov Principal Platform Engineer Carrier IQ, www.carrieriq.com e-mail: [EMAIL PROTECTED] ________________________________________ From: Jacques [[EMAIL PROTECTED]] Sent: Wednesday, February 15, 2012 3:45 PM To: [EMAIL PROTECTED] Subject: Re: Scan performance on a big table as combination of multiple logic tables Out of curiosity, what do you perceive as the benefit to having only one table? Are there reasons that you think one table would perform better than a few? If you're splitting data within a table because you'd otherwise have millions of tables, I understand that and would concur with Vladimir's approach below. However, if you're really looking at 10 tables versus one table, it seems like HBase is built exactly to make that work well (rather than having to make all sorts of application level code to do what HBase already does). thanks, Jacques On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > Since Hbase is tailored to handle one table very well, we are thinking to > put multiple tables into one big table but on different column family sets. > Our use case is full table scan against single column value filters. As > records from different "logical tables" are at different column families, > could we speed up the scan performance by simply checking the column family > referenced by these single column value filters first before really going > through all the underlying K-V pairs? It would be great if the Hbase code > is already coded that way. > > > $0.02, > Thomas > > Confidentiality Notice: The information contained in this message, including any attachments hereto, may be confidential and is intended to be read only by the individual or entity to whom this message is addressed. If the reader of this message is not the intended recipient or an agent or designee of the intended recipient, please note that any review, use, disclosure or distribution of this message or its attachments, in any form, is strictly prohibited. If you have received this message in error, please immediately notify the sender and/or [EMAIL PROTECTED] and delete or destroy any copy of this message and its attachments. +
Vladimir Rodionov 2012-02-16, 00:11
-
Re: Scan performance on a big table as combination of multiple logic tablesAndrew Purtell 2012-02-16, 01:43
> Too many regions kill HBase.
How many regions do you carry per RS? What was the effective limit you encountered? Curious. The available public information is getting old now but BigTable deployments at Google limited the number of tablets per tablet server to ~100. This was for a number of reasons related to their specific hardware configuration, no doubt, considerations such as having enough RAM to keep in memory tables in memory, and the fact they had something like 160 or 320 GB of local storage only, and so on; but also presumably to limit the scope of failure of a given server, and to keep overheads down. I advise our ops people to set notifications for when the number of regions per HBase RegionServer gets above 500. The more regions per server, the more must be relocated per server failure, the longer some regions will be in transition. When we get close to the limit, it's time to add another RegionServer. (Even if HBase could handle 10,000 regions per RegionServer that wouldn't be a good idea without a distributed master of some kind.) If you are scaling out for this reason already, then the region carrying capacity of the cluster is also scaling. We have many thousands of regions and region housekeeping overhead is not an issue, although we are certainly not the largest deployment. Currently the META region isn't split, I think that might impose an effective upper bound at some point, but that can be fixed. There's no architectural limit that I am aware of. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Vladimir Rodionov <[EMAIL PROTECTED]> > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > Cc: > Sent: Wednesday, February 15, 2012 4:11 PM > Subject: RE: Scan performance on a big table as combination of multiple logic tables > > 10 tables are fine. 1000 are not, especially when one does table pre-splitting > to increase write perf. > > Too many regions kill HBase. > > Best regards, > Vladimir Rodionov > Principal Platform Engineer > Carrier IQ, www.carrieriq.com > e-mail: [EMAIL PROTECTED] > > ________________________________________ > From: Jacques [[EMAIL PROTECTED]] > Sent: Wednesday, February 15, 2012 3:45 PM > To: [EMAIL PROTECTED] > Subject: Re: Scan performance on a big table as combination of multiple logic > tables > > Out of curiosity, what do you perceive as the benefit to having only one > table? Are there reasons that you think one table would perform better > than a few? > > If you're splitting data within a table because you'd otherwise have > millions of tables, I understand that and would concur with Vladimir's > approach below. However, if you're really looking at 10 tables versus one > table, it seems like HBase is built exactly to make that work well (rather > than having to make all sorts of application level code to do what HBase > already does). > > thanks, > Jacques > > On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > >> >> Since Hbase is tailored to handle one table very well, we are thinking to >> put multiple tables into one big table but on different column family sets. >> Our use case is full table scan against single column value filters. As >> records from different "logical tables" are at different column > families, >> could we speed up the scan performance by simply checking the column family >> referenced by these single column value filters first before really going >> through all the underlying K-V pairs? It would be great if the Hbase code >> is already coded that way. >> >> >> $0.02, >> Thomas >> >> > > Confidentiality Notice: The information contained in this message, including > any attachments hereto, may be confidential and is intended to be read only by > the individual or entity to whom this message is addressed. If the reader of > this message is not the intended recipient or an agent or designee of the +
Andrew Purtell 2012-02-16, 01:43
-
Re: Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-17, 18:55
Vladimire and Jacques, Thanks for the information! Unless Hbase well handles multiple big sized tables (relatively high region count) in one cluster, it seems to me that one big table is the way to go. Otherwise, runtime tuning seems to add quite amount of operational cost. That leads to another question. Do we see big region size as an issue? If so, what's the pivot point as region size grows further, the scan performance starts to degrade exponentially? On 2/15/12 4:11 PM, "Vladimir Rodionov" <[EMAIL PROTECTED]> wrote: >10 tables are fine. 1000 are not, especially when one does table >pre-splitting to increase write perf. > >Too many regions kill HBase. > >Best regards, >Vladimir Rodionov >Principal Platform Engineer >Carrier IQ, www.carrieriq.com >e-mail: [EMAIL PROTECTED] > >________________________________________ >From: Jacques [[EMAIL PROTECTED]] >Sent: Wednesday, February 15, 2012 3:45 PM >To: [EMAIL PROTECTED] >Subject: Re: Scan performance on a big table as combination of multiple >logic tables > >Out of curiosity, what do you perceive as the benefit to having only one >table? Are there reasons that you think one table would perform better >than a few? > >If you're splitting data within a table because you'd otherwise have >millions of tables, I understand that and would concur with Vladimir's >approach below. However, if you're really looking at 10 tables versus one >table, it seems like HBase is built exactly to make that work well (rather >than having to make all sorts of application level code to do what HBase >already does). > >thanks, >Jacques > >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > >> >> Since Hbase is tailored to handle one table very well, we are thinking >>to >> put multiple tables into one big table but on different column family >>sets. >> Our use case is full table scan against single column value filters. As >> records from different "logical tables" are at different column >>families, >> could we speed up the scan performance by simply checking the column >>family >> referenced by these single column value filters first before really >>going >> through all the underlying K-V pairs? It would be great if the Hbase >>code >> is already coded that way. >> >> >> $0.02, >> Thomas >> >> > >Confidentiality Notice: The information contained in this message, >including any attachments hereto, may be confidential and is intended to >be read only by the individual or entity to whom this message is >addressed. If the reader of this message is not the intended recipient or >an agent or designee of the intended recipient, please note that any >review, use, disclosure or distribution of this message or its >attachments, in any form, is strictly prohibited. If you have received >this message in error, please immediately notify the sender and/or >[EMAIL PROTECTED] and delete or destroy any copy of this >message and its attachments. +
Pan, Thomas 2012-02-17, 18:55
-
Re: Scan performance on a big table as combination of multiple logic tablesJacques 2012-02-17, 22:46
You should be fine having multiple tables with high region counts. I would
avoid making thousands of tables. However, if you have three separate business needs, make three different tables. You seem to be starting with a perspective that there would be some kind of issues with multiple tables. Why do you think this exists? You said "Otherwise, runtime tuning seems to add quite amount of operational cost." I'm not sure what you are thinking here and where your thoughts are coming from. Additionally, if you have separate tables, then you can modify them differently (e.g. setting them to different region sizes if it makes sense-- for example, some of our tables have smaller region sizes so we'll have more maps rather than fewer when we run map reduce jobs). Regarding region size: the HTable v1 format in 0.90 and below suffered from taking a long time to transition as individual regions got too big. With 0.92 and HTablev2 that isn't as much of a problem as I understand it. If I recall correctly, there are numerous organizations using 10gb regions with sucess-- (among others, I believe this what Yahoo reported they were using for their web crawl tables on their thousand node cluster). While I haven't run any stats, I believe that there is negligible scan performance impact as region size grows. There is definitely no exponential negative performance impact. On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > Vladimire and Jacques, Thanks for the information! Unless Hbase well > handles multiple big sized tables (relatively high region count) in one > cluster, it seems to me that one big table is the way to go. Otherwise, > runtime tuning seems to add quite amount of operational cost. That leads > to another question. Do we see big region size as an issue? If so, what's > the pivot point as region size grows further, the scan performance starts > to degrade exponentially? > > On 2/15/12 4:11 PM, "Vladimir Rodionov" <[EMAIL PROTECTED]> wrote: > > >10 tables are fine. 1000 are not, especially when one does table > >pre-splitting to increase write perf. > > > >Too many regions kill HBase. > > > >Best regards, > >Vladimir Rodionov > >Principal Platform Engineer > >Carrier IQ, www.carrieriq.com > >e-mail: [EMAIL PROTECTED] > > > >________________________________________ > >From: Jacques [[EMAIL PROTECTED]] > >Sent: Wednesday, February 15, 2012 3:45 PM > >To: [EMAIL PROTECTED] > >Subject: Re: Scan performance on a big table as combination of multiple > >logic tables > > > >Out of curiosity, what do you perceive as the benefit to having only one > >table? Are there reasons that you think one table would perform better > >than a few? > > > >If you're splitting data within a table because you'd otherwise have > >millions of tables, I understand that and would concur with Vladimir's > >approach below. However, if you're really looking at 10 tables versus one > >table, it seems like HBase is built exactly to make that work well (rather > >than having to make all sorts of application level code to do what HBase > >already does). > > > >thanks, > >Jacques > > > >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > > >> > >> Since Hbase is tailored to handle one table very well, we are thinking > >>to > >> put multiple tables into one big table but on different column family > >>sets. > >> Our use case is full table scan against single column value filters. As > >> records from different "logical tables" are at different column > >>families, > >> could we speed up the scan performance by simply checking the column > >>family > >> referenced by these single column value filters first before really > >>going > >> through all the underlying K-V pairs? It would be great if the Hbase > >>code > >> is already coded that way. > >> > >> > >> $0.02, > >> Thomas > >> > >> > > > >Confidentiality Notice: The information contained in this message, > >including any attachments hereto, may be confidential and is intended to +
Jacques 2012-02-17, 22:46
-
Re: Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-18, 07:25
Jacques, thanks for the details on region size. We've observed that regions per region server could skew big time at the table level. We do have tool to balance regions. Still, it is sort of annoying to maintain the balance. $0.02, -Thomas On 2/17/12 2:46 PM, "Jacques" <[EMAIL PROTECTED]> wrote: >You should be fine having multiple tables with high region counts. I >would >avoid making thousands of tables. However, if you have three separate >business needs, make three different tables. > >You seem to be starting with a perspective that there would be some kind >of >issues with multiple tables. Why do you think this exists? You said >"Otherwise, runtime tuning seems to add quite amount of operational cost." >I'm not sure what you are thinking here and where your thoughts are coming >from. Additionally, if you have separate tables, then you can modify them >differently (e.g. setting them to different region sizes if it makes >sense-- for example, some of our tables have smaller region sizes so we'll >have more maps rather than fewer when we run map reduce jobs). > >Regarding region size: the HTable v1 format in 0.90 and below suffered >from >taking a long time to transition as individual regions got too big. With >0.92 and HTablev2 that isn't as much of a problem as I understand it. If >I >recall correctly, there are numerous organizations using 10gb regions with >sucess-- (among others, I believe this what Yahoo reported they were using >for their web crawl tables on their thousand node cluster). While I >haven't run any stats, I believe that there is negligible scan performance >impact as region size grows. There is definitely no exponential negative >performance impact. > > > >On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > >> >> Vladimire and Jacques, Thanks for the information! Unless Hbase well >> handles multiple big sized tables (relatively high region count) in one >> cluster, it seems to me that one big table is the way to go. Otherwise, >> runtime tuning seems to add quite amount of operational cost. That leads >> to another question. Do we see big region size as an issue? If so, >>what's >> the pivot point as region size grows further, the scan performance >>starts >> to degrade exponentially? >> >> On 2/15/12 4:11 PM, "Vladimir Rodionov" <[EMAIL PROTECTED]> wrote: >> >> >10 tables are fine. 1000 are not, especially when one does table >> >pre-splitting to increase write perf. >> > >> >Too many regions kill HBase. >> > >> >Best regards, >> >Vladimir Rodionov >> >Principal Platform Engineer >> >Carrier IQ, www.carrieriq.com >> >e-mail: [EMAIL PROTECTED] >> > >> >________________________________________ >> >From: Jacques [[EMAIL PROTECTED]] >> >Sent: Wednesday, February 15, 2012 3:45 PM >> >To: [EMAIL PROTECTED] >> >Subject: Re: Scan performance on a big table as combination of multiple >> >logic tables >> > >> >Out of curiosity, what do you perceive as the benefit to having only >>one >> >table? Are there reasons that you think one table would perform better >> >than a few? >> > >> >If you're splitting data within a table because you'd otherwise have >> >millions of tables, I understand that and would concur with Vladimir's >> >approach below. However, if you're really looking at 10 tables versus >>one >> >table, it seems like HBase is built exactly to make that work well >>(rather >> >than having to make all sorts of application level code to do what >>HBase >> >already does). >> > >> >thanks, >> >Jacques >> > >> >On Wed, Feb 15, 2012 at 1:57 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: >> > >> >> >> >> Since Hbase is tailored to handle one table very well, we are >>thinking >> >>to >> >> put multiple tables into one big table but on different column family >> >>sets. >> >> Our use case is full table scan against single column value filters. >>As >> >> records from different "logical tables" are at different column >> >>families, >> >> could we speed up the scan performance by simply checking the column +
Pan, Thomas 2012-02-18, 07:25
-
Re: Scan performance on a big table as combination of multiple logic tablesM. C. Srivas 2012-02-19, 16:38
What is the impact when a compaction happens on a large 20G region? Given
that the FS will do writes at 30 MB/s (over a single 1 GigE link), it will take about 1500 seconds to read/write the region. Is the region out of service for 25 mins (= 1500 seconds)? On Fri, Feb 17, 2012 at 11:25 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > Jacques, thanks for the details on region size. We've observed that > regions per region server could skew big time at the table level. We do > have tool to balance regions. Still, it is sort of annoying to maintain > the balance. $0.02, -Thomas > > On 2/17/12 2:46 PM, "Jacques" <[EMAIL PROTECTED]> wrote: > > >You should be fine having multiple tables with high region counts. I > >would > >avoid making thousands of tables. However, if you have three separate > >business needs, make three different tables. > > > >You seem to be starting with a perspective that there would be some kind > >of > >issues with multiple tables. Why do you think this exists? You said > >"Otherwise, runtime tuning seems to add quite amount of operational cost." > >I'm not sure what you are thinking here and where your thoughts are coming > >from. Additionally, if you have separate tables, then you can modify them > >differently (e.g. setting them to different region sizes if it makes > >sense-- for example, some of our tables have smaller region sizes so we'll > >have more maps rather than fewer when we run map reduce jobs). > > > >Regarding region size: the HTable v1 format in 0.90 and below suffered > >from > >taking a long time to transition as individual regions got too big. With > >0.92 and HTablev2 that isn't as much of a problem as I understand it. If > >I > >recall correctly, there are numerous organizations using 10gb regions with > >sucess-- (among others, I believe this what Yahoo reported they were using > >for their web crawl tables on their thousand node cluster). While I > >haven't run any stats, I believe that there is negligible scan performance > >impact as region size grows. There is definitely no exponential negative > >performance impact. > > > > > > > >On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > > >> > >> Vladimire and Jacques, Thanks for the information! Unless Hbase well > >> handles multiple big sized tables (relatively high region count) in one > >> cluster, it seems to me that one big table is the way to go. Otherwise, > >> runtime tuning seems to add quite amount of operational cost. That leads > >> to another question. Do we see big region size as an issue? If so, > >>what's > >> the pivot point as region size grows further, the scan performance > >>starts > >> to degrade exponentially? > >> > >> On 2/15/12 4:11 PM, "Vladimir Rodionov" <[EMAIL PROTECTED]> > wrote: > >> > >> >10 tables are fine. 1000 are not, especially when one does table > >> >pre-splitting to increase write perf. > >> > > >> >Too many regions kill HBase. > >> > > >> >Best regards, > >> >Vladimir Rodionov > >> >Principal Platform Engineer > >> >Carrier IQ, www.carrieriq.com > >> >e-mail: [EMAIL PROTECTED] > >> > > >> >________________________________________ > >> >From: Jacques [[EMAIL PROTECTED]] > >> >Sent: Wednesday, February 15, 2012 3:45 PM > >> >To: [EMAIL PROTECTED] > >> >Subject: Re: Scan performance on a big table as combination of multiple > >> >logic tables > >> > > >> >Out of curiosity, what do you perceive as the benefit to having only > >>one > >> >table? Are there reasons that you think one table would perform better > >> >than a few? > >> > > >> >If you're splitting data within a table because you'd otherwise have > >> >millions of tables, I understand that and would concur with Vladimir's > >> >approach below. However, if you're really looking at 10 tables versus > >>one > >> >table, it seems like HBase is built exactly to make that work well > >>(rather > >> >than having to make all sorts of application level code to do what > >>HBase > >> >already does). +
M. C. Srivas 2012-02-19, 16:38
-
Re: Scan performance on a big table as combination of multiple logic tablesMikael Sitruk 2012-02-19, 21:45
During compaction the region is not out of service.
According to documentation the max region size for V2 format is 20G And now the question: Assuming that 20G is the limit and the number of regions in a single RS should stay low < 500 it means that there is no mean having RS with more than 10TB of storage to use by HBase (otherwise locality will not be achieve for some servers, i also assume that compression is used and therefore it compensate the need for additional space for replication)? If the max number of region per RS is smaller then the storage size is even smaller. Is it correct? Mikael.S On Sun, Feb 19, 2012 at 6:38 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > What is the impact when a compaction happens on a large 20G region? Given > that the FS will do writes at 30 MB/s (over a single 1 GigE link), it will > take about 1500 seconds to read/write the region. Is the region out of > service for 25 mins (= 1500 seconds)? > > > On Fri, Feb 17, 2012 at 11:25 PM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > > > > Jacques, thanks for the details on region size. We've observed that > > regions per region server could skew big time at the table level. We do > > have tool to balance regions. Still, it is sort of annoying to maintain > > the balance. $0.02, -Thomas > > > > On 2/17/12 2:46 PM, "Jacques" <[EMAIL PROTECTED]> wrote: > > > > >You should be fine having multiple tables with high region counts. I > > >would > > >avoid making thousands of tables. However, if you have three separate > > >business needs, make three different tables. > > > > > >You seem to be starting with a perspective that there would be some kind > > >of > > >issues with multiple tables. Why do you think this exists? You said > > >"Otherwise, runtime tuning seems to add quite amount of operational > cost." > > >I'm not sure what you are thinking here and where your thoughts are > coming > > >from. Additionally, if you have separate tables, then you can modify > them > > >differently (e.g. setting them to different region sizes if it makes > > >sense-- for example, some of our tables have smaller region sizes so > we'll > > >have more maps rather than fewer when we run map reduce jobs). > > > > > >Regarding region size: the HTable v1 format in 0.90 and below suffered > > >from > > >taking a long time to transition as individual regions got too big. > With > > >0.92 and HTablev2 that isn't as much of a problem as I understand it. > If > > >I > > >recall correctly, there are numerous organizations using 10gb regions > with > > >sucess-- (among others, I believe this what Yahoo reported they were > using > > >for their web crawl tables on their thousand node cluster). While I > > >haven't run any stats, I believe that there is negligible scan > performance > > >impact as region size grows. There is definitely no exponential > negative > > >performance impact. > > > > > > > > > > > >On Fri, Feb 17, 2012 at 10:55 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote: > > > > > >> > > >> Vladimire and Jacques, Thanks for the information! Unless Hbase well > > >> handles multiple big sized tables (relatively high region count) in > one > > >> cluster, it seems to me that one big table is the way to go. > Otherwise, > > >> runtime tuning seems to add quite amount of operational cost. That > leads > > >> to another question. Do we see big region size as an issue? If so, > > >>what's > > >> the pivot point as region size grows further, the scan performance > > >>starts > > >> to degrade exponentially? > > >> > > >> On 2/15/12 4:11 PM, "Vladimir Rodionov" <[EMAIL PROTECTED]> > > wrote: > > >> > > >> >10 tables are fine. 1000 are not, especially when one does table > > >> >pre-splitting to increase write perf. > > >> > > > >> >Too many regions kill HBase. > > >> > > > >> >Best regards, > > >> >Vladimir Rodionov > > >> >Principal Platform Engineer > > >> >Carrier IQ, www.carrieriq.com > > >> >e-mail: [EMAIL PROTECTED] > > >> > > > >> >________________________________________ Mikael.S +
Mikael Sitruk 2012-02-19, 21:45
-
Re: Scan performance on a big table as combination of multiple logic tablesJean-Daniel Cryans 2012-02-21, 20:08
On Sun, Feb 19, 2012 at 1:45 PM, Mikael Sitruk <[EMAIL PROTECTED]> wrote:
> During compaction the region is not out of service. > According to documentation the max region size for V2 format is 20G > And now the question: Assuming that 20G is the limit and the number of > regions in a single RS should stay low < 500 it means that there is no mean > having RS with more than 10TB of storage to use by HBase (otherwise > locality will not be achieve for some servers, i also assume that > compression is used and therefore it compensate the need for additional > space for replication)? > If the max number of region per RS is smaller then the storage size is even > smaller. Is it correct? In the documentation 20GB is given as an example of a larger size that can be supported, but nothing blocks you from going way higher than that. I've done some import tests and had 100GB regions. It just takes a while to compact the bigger files. Also you can go over 500 regions, in fact one of our clusters has 14,398 regions right now. It's just a pain to reassign everything when HBase boots but this is an offline cluster. J-D +
Jean-Daniel Cryans 2012-02-21, 20:08
-
Re: Scan performance on a big table as combination of multiple logic tablesMikael Sitruk 2012-02-21, 21:17
This is interesting J.D. so, is there a limitation on the region size or
not? Can it be really any number? If so beside the collection time is there any impact (perhaps the documentation should be updated too)? Regarding the number of regions you have (14,398) is it for a single RS? What is your number of RS? Mikael.S On Feb 21, 2012 10:09 PM, "Jean-Daniel Cryans" <[EMAIL PROTECTED]> wrote: > On Sun, Feb 19, 2012 at 1:45 PM, Mikael Sitruk <[EMAIL PROTECTED]> > wrote: > > During compaction the region is not out of service. > > According to documentation the max region size for V2 format is 20G > > And now the question: Assuming that 20G is the limit and the number of > > regions in a single RS should stay low < 500 it means that there is no > mean > > having RS with more than 10TB of storage to use by HBase (otherwise > > locality will not be achieve for some servers, i also assume that > > compression is used and therefore it compensate the need for additional > > space for replication)? > > If the max number of region per RS is smaller then the storage size is > even > > smaller. Is it correct? > > In the documentation 20GB is given as an example of a larger size that > can be supported, but nothing blocks you from going way higher than > that. I've done some import tests and had 100GB regions. It just takes > a while to compact the bigger files. > > Also you can go over 500 regions, in fact one of our clusters has > 14,398 regions right now. It's just a pain to reassign everything when > HBase boots but this is an offline cluster. > > J-D > +
Mikael Sitruk 2012-02-21, 21:17
-
Re: Scan performance on a big table as combination of multiple logic tablesJean-Daniel Cryans 2012-02-21, 21:40
On Tue, Feb 21, 2012 at 1:17 PM, Mikael Sitruk <[EMAIL PROTECTED]> wrote:
> This is interesting J.D. so, is there a limitation on the region size or > not? Your imagination? Like I said nothing blocks you in the code. > Can it be really any number? That's what it implies. > If so beside the collection time is there > any impact (perhaps the documentation should be updated too)? Collection time? You mean GC? Sorry I don't get what you mean. > Regarding the number of regions you have (14,398) is it for a single RS? > What is your number of RS? Currently 91 in that cluster. It varies :) We have >200 tables coming all in different sizes. J-D > > Mikael.S > On Feb 21, 2012 10:09 PM, "Jean-Daniel Cryans" <[EMAIL PROTECTED]> wrote: > >> On Sun, Feb 19, 2012 at 1:45 PM, Mikael Sitruk <[EMAIL PROTECTED]> >> wrote: >> > During compaction the region is not out of service. >> > According to documentation the max region size for V2 format is 20G >> > And now the question: Assuming that 20G is the limit and the number of >> > regions in a single RS should stay low < 500 it means that there is no >> mean >> > having RS with more than 10TB of storage to use by HBase (otherwise >> > locality will not be achieve for some servers, i also assume that >> > compression is used and therefore it compensate the need for additional >> > space for replication)? >> > If the max number of region per RS is smaller then the storage size is >> even >> > smaller. Is it correct? >> >> In the documentation 20GB is given as an example of a larger size that >> can be supported, but nothing blocks you from going way higher than >> that. I've done some import tests and had 100GB regions. It just takes >> a while to compact the bigger files. >> >> Also you can go over 500 regions, in fact one of our clusters has >> 14,398 regions right now. It's just a pain to reassign everything when >> HBase boots but this is an offline cluster. >> >> J-D >> +
Jean-Daniel Cryans 2012-02-21, 21:40
-
Re: Scan performance on a big table as combination of multiple logic tablesMikael Sitruk 2012-02-21, 21:57
See inline
On Feb 21, 2012 11:40 PM, "Jean-Daniel Cryans" <[EMAIL PROTECTED]> wrote: > > On Tue, Feb 21, 2012 at 1:17 PM, Mikael Sitruk <[EMAIL PROTECTED]> wrote: > > This is interesting J.D. so, is there a limitation on the region size or > > not? > > Your imagination? Like I said nothing blocks you in the code. > > > Can it be really any number? > > That's what it implies. > > > If so beside the collection time is there > > any impact (perhaps the documentation should be updated too)? > > Collection time? You mean GC? Sorry I don't get what you mean. > *Sorry, typo mistake (from mobile) I meant compaction not collection > > Regarding the number of regions you have (14,398) is it for a single RS? > > What is your number of RS? > > Currently 91 in that cluster. It varies :) > > We have >200 tables coming all in different sizes. *Not clear, 91 rs, and 14398 regions in total? Or per RS? Mikael.S > J-D > > > > > Mikael.S > > On Feb 21, 2012 10:09 PM, "Jean-Daniel Cryans" <[EMAIL PROTECTED]> wrote: > > > >> On Sun, Feb 19, 2012 at 1:45 PM, Mikael Sitruk <[EMAIL PROTECTED] > > >> wrote: > >> > During compaction the region is not out of service. > >> > According to documentation the max region size for V2 format is 20G > >> > And now the question: Assuming that 20G is the limit and the number of > >> > regions in a single RS should stay low < 500 it means that there is no > >> mean > >> > having RS with more than 10TB of storage to use by HBase (otherwise > >> > locality will not be achieve for some servers, i also assume that > >> > compression is used and therefore it compensate the need for additional > >> > space for replication)? > >> > If the max number of region per RS is smaller then the storage size is > >> even > >> > smaller. Is it correct? > >> > >> In the documentation 20GB is given as an example of a larger size that > >> can be supported, but nothing blocks you from going way higher than > >> that. I've done some import tests and had 100GB regions. It just takes > >> a while to compact the bigger files. > >> > >> Also you can go over 500 regions, in fact one of our clusters has > >> 14,398 regions right now. It's just a pain to reassign everything when > >> HBase boots but this is an offline cluster. > >> > >> J-D > >> +
Mikael Sitruk 2012-02-21, 21:57
-
Re: Scan performance on a big table as combination of multiple logic tablesJean-Daniel Cryans 2012-02-21, 22:13
On Tue, Feb 21, 2012 at 1:57 PM, Mikael Sitruk <[EMAIL PROTECTED]> wrote:
>> > If so beside the collection time is there >> > any impact (perhaps the documentation should be updated too)? >> >> Collection time? You mean GC? Sorry I don't get what you mean. >> > > *Sorry, typo mistake (from mobile) I meant compaction not collection Ah! Well there's a ton of impacts starting from having less regions :) But definitely compactions will take a lot longer the bigger the regions are since more and more is done in a single process. The documentation could definitely have more info on that. > >> > Regarding the number of regions you have (14,398) is it for a single RS? >> > What is your number of RS? >> >> Currently 91 in that cluster. It varies :) >> >> We have >200 tables coming all in different sizes. > > *Not clear, 91 rs, and 14398 regions in total? Or per RS? Oh sorry, total. 14k on a single RS is impossible/suicide if you have any data in there because it would OOME trying to load the indexes (better in 0.92 tho). J-D +
Jean-Daniel Cryans 2012-02-21, 22:13
-
Re: Scan performance on a big table as combination of multiple logic tablesMikael Sitruk 2012-02-21, 22:30
Ok, so this is approx 150 regions per RS
What are the maths between the memory (index size) and number of regions? (Btw at the beginning when I mentionned 500 regions it was per RS.) I'm trying to figure out what should be my cluster configuration, regarding region, region size, memory size, and number of RS for the volume and workload I'm using On Feb 22, 2012 12:14 AM, "Jean-Daniel Cryans" <[EMAIL PROTECTED]> wrote: > On Tue, Feb 21, 2012 at 1:57 PM, Mikael Sitruk <[EMAIL PROTECTED]> > wrote: > >> > If so beside the collection time is there > >> > any impact (perhaps the documentation should be updated too)? > >> > >> Collection time? You mean GC? Sorry I don't get what you mean. > >> > > > > *Sorry, typo mistake (from mobile) I meant compaction not collection > > Ah! Well there's a ton of impacts starting from having less regions :) > But definitely compactions will take a lot longer the bigger the > regions are since more and more is done in a single process. The > documentation could definitely have more info on that. > > > > >> > Regarding the number of regions you have (14,398) is it for a single > RS? > >> > What is your number of RS? > >> > >> Currently 91 in that cluster. It varies :) > >> > >> We have >200 tables coming all in different sizes. > > > > *Not clear, 91 rs, and 14398 regions in total? Or per RS? > > Oh sorry, total. 14k on a single RS is impossible/suicide if you have > any data in there because it would OOME trying to load the indexes > (better in 0.92 tho). > > J-D > +
Mikael Sitruk 2012-02-21, 22:30
-
Re: Scan performance on a big table as combination of multiple logic tablesJean-Daniel Cryans 2012-02-21, 23:31
This describes how they are written, with your knowledge of your data
size and key average size you can do the math: http://hbase.apache.org/book.html#d0e9542 J-D On Tue, Feb 21, 2012 at 2:30 PM, Mikael Sitruk <[EMAIL PROTECTED]> wrote: > Ok, so this is approx 150 regions per RS > What are the maths between the memory (index size) and number of regions? > (Btw at the beginning when I mentionned 500 regions it was per RS.) > I'm trying to figure out what should be my cluster configuration, regarding > region, region size, memory size, and number of RS for the volume and > workload I'm using > On Feb 22, 2012 12:14 AM, "Jean-Daniel Cryans" <[EMAIL PROTECTED]> wrote: > >> On Tue, Feb 21, 2012 at 1:57 PM, Mikael Sitruk <[EMAIL PROTECTED]> >> wrote: >> >> > If so beside the collection time is there >> >> > any impact (perhaps the documentation should be updated too)? >> >> >> >> Collection time? You mean GC? Sorry I don't get what you mean. >> >> >> > >> > *Sorry, typo mistake (from mobile) I meant compaction not collection >> >> Ah! Well there's a ton of impacts starting from having less regions :) >> But definitely compactions will take a lot longer the bigger the >> regions are since more and more is done in a single process. The >> documentation could definitely have more info on that. >> >> > >> >> > Regarding the number of regions you have (14,398) is it for a single >> RS? >> >> > What is your number of RS? >> >> >> >> Currently 91 in that cluster. It varies :) >> >> >> >> We have >200 tables coming all in different sizes. >> > >> > *Not clear, 91 rs, and 14398 regions in total? Or per RS? >> >> Oh sorry, total. 14k on a single RS is impossible/suicide if you have >> any data in there because it would OOME trying to load the indexes >> (better in 0.92 tho). >> >> J-D >> +
Jean-Daniel Cryans 2012-02-21, 23:31
-
Re: Scan performance on a big table as combination of multiple logic tablesStack 2012-02-22, 01:33
On Tue, Feb 21, 2012 at 1:17 PM, Mikael Sitruk <[EMAIL PROTECTED]> wrote:
> This is interesting J.D. so, is there a limitation on the region size or > not? Can it be really any number? If so beside the collection time is there > any impact (perhaps the documentation should be updated too)? Yes. It should not be read as a hard limit. If that is what it says, we need a patch for the doc. St.Ack +
Stack 2012-02-22, 01:33
-
Re: Scan performance on a big table as combination of multiple logic tablesM. C. Srivas 2012-02-22, 01:44
On Tue, Feb 21, 2012 at 12:08 PM, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote:
> On Sun, Feb 19, 2012 at 1:45 PM, Mikael Sitruk <[EMAIL PROTECTED]> > wrote: > > During compaction the region is not out of service. > > According to documentation the max region size for V2 format is 20G > > And now the question: Assuming that 20G is the limit and the number of > > regions in a single RS should stay low < 500 it means that there is no > mean > > having RS with more than 10TB of storage to use by HBase (otherwise > > locality will not be achieve for some servers, i also assume that > > compression is used and therefore it compensate the need for additional > > space for replication)? > > If the max number of region per RS is smaller then the storage size is > even > > smaller. Is it correct? > > In the documentation 20GB is given as an example of a larger size that > can be supported, but nothing blocks you from going way higher than > that. I've done some import tests and had 100GB regions. It just takes > a while to compact the bigger files. > With no impact on Java GC going nuts? FB reported (a few months ago) it was bad to run a region-server with -Xmx larger than 15G or 16G. Unless its no longer true, wouldn't that be limiting factor for how large one should make regions? > > Also you can go over 500 regions, in fact one of our clusters has > 14,398 regions right now. It's just a pain to reassign everything when > HBase boots but this is an offline cluster. > > J-D > +
M. C. Srivas 2012-02-22, 01:44
-
Re: Scan performance on a big table as combination of multiple logic tablesJean-Daniel Cryans 2012-02-22, 01:56
>> In the documentation 20GB is given as an example of a larger size that
>> can be supported, but nothing blocks you from going way higher than >> that. I've done some import tests and had 100GB regions. It just takes >> a while to compact the bigger files. >> > > With no impact on Java GC going nuts? FB reported (a few months ago) it > was bad to run a region-server > with -Xmx larger than 15G or 16G. Unless its no longer true, wouldn't that > be limiting factor for how > large one should make regions? You'll have to explain how having "big regions" means you GC at lot, I don't see the relation. J-D +
Jean-Daniel Cryans 2012-02-22, 01:56
-
Re: Scan performance on a big table as combination of multiple logic tablesStack 2012-02-22, 02:16
On Tue, Feb 21, 2012 at 5:44 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
> With no impact on Java GC going nuts? FB reported (a few months ago) it > was bad to run a region-server > with -Xmx larger than 15G or 16G. Unless its no longer true, wouldn't that > be limiting factor for how > large one should make regions? > We don't bring the total region into memory Srivas (Is that what you are thinking?). The FB recommendation of > 15G heaps was probably the old adage around big heaps taking a long time to sweep when GCing? Good on you, St.Ack +
Stack 2012-02-22, 02:16
-
Re: Scan performance on a big table as combination of multiple logic tablesM. C. Srivas 2012-02-22, 05:29
On Tue, Feb 21, 2012 at 6:16 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Tue, Feb 21, 2012 at 5:44 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > > With no impact on Java GC going nuts? FB reported (a few months ago) it > > was bad to run a region-server > > with -Xmx larger than 15G or 16G. Unless its no longer true, wouldn't > that > > be limiting factor for how > > large one should make regions? > > > > We don't bring the total region into memory Srivas (Is that what you > are thinking?). > Yes, that was my thinking --- to do a major compaction the region-server would have to load all the flushed files for that region, merge them, and then write out the new region. If the region-file was 20g in size, the region-server would require well over 20g of heap space to do this work. Am I completely off? > The FB recommendation of > 15G heaps was probably the old adage around > big heaps taking a long time to sweep when GCing? > > Good on you, > St.Ack > +
M. C. Srivas 2012-02-22, 05:29
-
Re: Scan performance on a big table as combination of multiple logic tablesStack 2012-02-22, 05:58
On Tue, Feb 21, 2012 at 9:29 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
> Yes, that was my thinking --- to do a major compaction the region-server > would have to load all the flushed files for that region, merge them, and > then write out the new region. If the region-file was 20g in size, the > region-server would require well over 20g of heap space to do this work. Am > I completely off? > You are a little off. We open all hfiles and then stream through each of them doing a merge sort streaming the outputting to the new compacted file. Here is where we open a scanner on all the files to compact and then as we inch through, we figure what to write to the output: http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/Store.html#1393 (Its a bit hard to follow whats going on -- file selection is done already higher up in call chain). St.Ack +
Stack 2012-02-22, 05:58
-
Re: Scan performance on a big table as combination of multiple logic tablesM. C. Srivas 2012-02-24, 06:34
On Tue, Feb 21, 2012 at 9:58 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Tue, Feb 21, 2012 at 9:29 PM, M. C. Srivas <[EMAIL PROTECTED]> wrote: > > Yes, that was my thinking --- to do a major compaction the > region-server > > would have to load all the flushed files for that region, merge them, and > > then write out the new region. If the region-file was 20g in size, the > > region-server would require well over 20g of heap space to do this work. > Am > > I completely off? > > > > You are a little off. We open all hfiles and then stream through each > of them doing a merge sort streaming the outputting to the new > compacted file. > Doh! Seems obvious once you mention it. Sorry about that. > > Here is where we open a scanner on all the files to compact and then > as we inch through, we figure what to write to the output: > > http://hbase.apache.org/xref/org/apache/hadoop/hbase/regionserver/Store.html#1393 > > (Its a bit hard to follow whats going on -- file selection is done > already higher up in call chain). > > St.Ack > +
M. C. Srivas 2012-02-24, 06:34
-
Re: Scan performance on a big table as combination of multiple logic tablesJean-Daniel Cryans 2012-02-21, 20:05
On Sun, Feb 19, 2012 at 8:38 AM, M. C. Srivas <[EMAIL PROTECTED]> wrote:
> What is the impact when a compaction happens on a large 20G region? Given > that the FS will do writes at 30 MB/s (over a single 1 GigE link), it will > take about 1500 seconds to read/write the region. Is the region out of > service for 25 mins (= 1500 seconds)? It would be awful if it did :) And fortunately it does not. J-D +
Jean-Daniel Cryans 2012-02-21, 20:05
-
Re: Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-24, 18:44
Just a quick heads-up. Ted pointed me to this jira: https://issues.apache.org/jira/browse/HBASE-5416 Max (the author) has confirmed that the patch provides what I want. :-) On 2/15/12 1:57 PM, "Pan, Thomas" <[EMAIL PROTECTED]> wrote: > >Since Hbase is tailored to handle one table very well, we are thinking to >put multiple tables into one big table but on different column family >sets. Our use case is full table scan against single column value >filters. As records from different "logical tables" are at different >column families, could we speed up the scan performance by simply >checking the column family referenced by these single column value >filters first before really going through all the underlying K-V pairs? >It would be great if the Hbase code is already coded that way. > > >$0.02, >Thomas > +
Pan, Thomas 2012-02-24, 18:44
-
Re: Scan performance on a big table as combination of multiple logic tablesStack 2012-02-24, 18:54
On Fri, Feb 24, 2012 at 10:44 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote:
> > Just a quick heads-up. Ted pointed me to this jira: > https://issues.apache.org/jira/browse/HBASE-5416 > Max (the author) has confirmed that the patch provides what I want. :-) > What do you think about what Mikhael says on the end? Have you tried doing two scans; one for the work to do and then another to do the work? St.Ack +
Stack 2012-02-24, 18:54
-
Re: Scan performance on a big table as combination of multiple logic tablesPan, Thomas 2012-02-25, 00:20
He has a good point on unit test coverage. Atomicity is not a concern for the use case mentioned in this email thread. :-) The two-scan approach doesn't seem to help as the second scan still goes through all the rows if my understanding is correct. -Thomas On 2/24/12 10:54 AM, "Stack" <[EMAIL PROTECTED]> wrote: >On Fri, Feb 24, 2012 at 10:44 AM, Pan, Thomas <[EMAIL PROTECTED]> wrote: >> >> Just a quick heads-up. Ted pointed me to this jira: >> https://issues.apache.org/jira/browse/HBASE-5416 >> Max (the author) has confirmed that the patch provides what I want. :-) >> > >What do you think about what Mikhael says on the end? Have you tried >doing two scans; one for the work to do and then another to do the >work? > >St.Ack +
Pan, Thomas 2012-02-25, 00:20
|