|
Joarder KAMAL
2013-02-11, 02:17
Kevin O'dell
2013-02-11, 02:32
Joarder KAMAL
2013-02-11, 03:43
Ted Yu
2013-02-11, 04:03
lars hofhansl
2013-02-11, 05:14
Joarder KAMAL
2013-02-11, 05:56
Ted Yu
2013-02-11, 14:55
|
-
HBase Region/Table HotspottingJoarder KAMAL 2013-02-11, 02:17
This is my first email in the group. I am having a more general and
open-ended question but hope to get some reasoning from the HBase user communities. I am a very basic HBase user and still learning. My intention to use HBase in one of our research project. Recently I was looking through Lars George's book "HBase - The Definitive Guide" and two particular topics caught my eyes. One is 'Region and Table Hotspotting' and the other is 'Region Auto-Sharding and Merging'. *Scenario: * If a hotspot is created in a particular region or in a table (having multiple regions) due to sudden workload change, then one may split the region into further small pieces and distributed it to a number of available physical machine in the cluster. This process should require large data transfer between different machines in the cluster and incur a performance cost. One may also change the 'key' definition and manage the regions. But I am not sure how effective or logical to change key designs on a production system. *Questions:* 1. How often you are facing Region or Table Hotspotting in HBase production systems? 2. If a hotspot is created, how quickly it is automatically cleared out (assuming sudden workload change)? 3. How often this kind of situation happens - A hotspot is detected and vanished out before taking an action? or hotspots stays longer period of time? 4. Or if the hotspot is stays, how it is handled (in general) in production system? 5. How large data transfer cost is minimized or avoid for re-sharding regions within a cluster in a single data center or within WAN? 6. Is hotspoting in HBase cluster is really a issue (big!) nowadays for OLAP workloads and real-time analytics? Further directions to more information about region/table hotspotting is most welcome. Many thanks in advance. Regards, Joarder Kamal
-
Re: HBase Region/Table HotspottingKevin O'dell 2013-02-11, 02:32
Hi Joarder,
Welcome to the HBase world. Let me take some time to address your questions the best I can: 1. How often you are facing Region or Table Hotspotting in HBase production systems? <--- Hotspotting is not something that just happens. This is usually caused by bad key design and writing to one region more than the others. I would recommend watching some of Lar's YouTube videos on Schema Design in HBase. 2. If a hotspot is created, how quickly it is automatically cleared out (assuming sudden workload change)? <--- It will not be automatically "cleared out" I think you may be misinformed here. Basically, it is on you to watch you table and your write distribution and determine that you have a hotspot and take the necessary action. Usually the only action is to split the region. If hotspots become a habitual problem you would most likely want to go back and re-evaluate your current key. 3. How often this kind of situation happens - A hotspot is detected and vanished out before taking an action? or hotspots stays longer period of time? <--- Please see above 4. Or if the hotspot is stays, how it is handled (in general) in production system? <--- Some people have to hotspot on purpose early on, because they only write to a subset of regions. You will have to manually watch for hotspots(which is much easier in later releases). 5. How large data transfer cost is minimized or avoid for re-sharding regions within a cluster in a single data center or within WAN? <--- Not quite sure what you are saying here, so I will take a best guess at it. Sharding is handled in HBase by region splitting. The best way to success in HBase is to understand your data and you needs BEFORE you create you table and start writing into HBase. This way you can presplit your table to handle the incoming data and you won't have to do a massive amounts of splits. Later you can allow HBase to split your tables manually, or you can set the maxfile size high and manually control the splits or sharding. 6. Is hotspoting in HBase cluster is really a issue (big!) nowadays for OLAP workloads and real-time analytics? <--- Just design your schema correctly and this should not be a problem for you. Please let me know if this answers your questions. On Sun, Feb 10, 2013 at 9:17 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote: > This is my first email in the group. I am having a more general and > open-ended question but hope to get some reasoning from the HBase user > communities. > I am a very basic HBase user and still learning. My intention to use HBase > in one of our research project. Recently I was looking through Lars > George's book "HBase - The Definitive Guide" and two particular topics > caught my eyes. One is 'Region and Table Hotspotting' and the other is > 'Region Auto-Sharding and Merging'. > > *Scenario: * > If a hotspot is created in a particular region or in a table (having > multiple regions) due to sudden workload change, then one may split the > region into further small pieces and distributed it to a number of > available physical machine in the cluster. This process should require > large data transfer between different machines in the cluster and incur a > performance cost. One may also change the 'key' definition and manage the > regions. But I am not sure how effective or logical to change key designs > on a production system. > > *Questions:* > > 1. How often you are facing Region or Table Hotspotting in HBase > production systems? > 2. If a hotspot is created, how quickly it is automatically cleared out > (assuming sudden workload change)? > 3. How often this kind of situation happens - A hotspot is detected and > vanished out before taking an action? or hotspots stays longer period of > time? > 4. Or if the hotspot is stays, how it is handled (in general) in > production system? > 5. How large data transfer cost is minimized or avoid for re-sharding > regions within a cluster in a single data center or within WAN? Kevin O'Dell Customer Operations Engineer, Cloudera
-
Re: HBase Region/Table HotspottingJoarder KAMAL 2013-02-11, 03:43
Hi Kevin,
Thanks a lot for your great answers. Regarding Q5. To clarify, lets say Facebook is using HBase for the integrated messaging/chat/email system in a very large-scale setup. And schema design of such system can change over the years (even over the months). Workload patterns may also change due to different usage characteristics (like the rate of messaging may be higher during a protest/specific event in a particular country). So, region/table hotspots have been created at random region servers within the cluster despite careful schema design and pre-planning. The facebook team rush to split the hotspotted regions manually and redistribute them over a new set of physical machines which are recently added to the system to increase scalability in the face of high user demand. Now hotspotted region data could be transferred into new physical machines gradually to handle the situations. Now if the shard (region) size is small enough then data transfer cost over the network could be minimum otherwise large volume of data needs to be transferred instantly. I have found in many places it is discouraged to have a large number of regions systems. However, would it be possible to have very large number of regions in a system thus minimizing data transfer cost in case hotspotting due to workload/design characteristics. Is there any drawbacks or known side-effects? I am rethinking other possibilities other pre-planned schema and row-key designs. Thanks again. Regards, Joarder Kamal On 11 February 2013 13:32, Kevin O'dell <[EMAIL PROTECTED]> wrote: > Hi Joarder, > > Welcome to the HBase world. Let me take some time to address your > questions the best I can: > > 1. How often you are facing Region or Table Hotspotting in HBase > production systems? <--- Hotspotting is not something that just happens. > This is usually caused by bad key design and writing to one region more > than the others. I would recommend watching some of Lar's YouTube videos > on Schema Design in HBase. > 2. If a hotspot is created, how quickly it is automatically cleared out > (assuming sudden workload change)? <--- It will not be automatically > "cleared out" I think you may be misinformed here. Basically, it is on you > to watch you table and your write distribution and determine that you have > a hotspot and take the necessary action. Usually the only action is to > split the region. If hotspots become a habitual problem you would most > likely want to go back and re-evaluate your current key. > 3. How often this kind of situation happens - A hotspot is detected and > vanished out before taking an action? or hotspots stays longer period of > time? <--- Please see above > 4. Or if the hotspot is stays, how it is handled (in general) in > production system? <--- Some people have to hotspot on purpose early on, > because they only write to a subset of regions. You will have to manually > watch for hotspots(which is much easier in later releases). > 5. How large data transfer cost is minimized or avoid for re-sharding > regions within a cluster in a single data center or within WAN? <--- Not > quite sure what you are saying here, so I will take a best guess at it. > Sharding is handled in HBase by region splitting. The best way to success > in HBase is to understand your data and you needs BEFORE you create you > table and start writing into HBase. This way you can presplit your table > to handle the incoming data and you won't have to do a massive amounts of > splits. Later you can allow HBase to split your tables manually, or you > can set the maxfile size high and manually control the splits or sharding. > 6. Is hotspoting in HBase cluster is really a issue (big!) nowadays for > OLAP workloads and real-time analytics? <--- Just design your schema > correctly and this should not be a problem for you. > > Please let me know if this answers your questions. > > On Sun, Feb 10, 2013 at 9:17 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote:
-
Re: HBase Region/Table HotspottingTed Yu 2013-02-11, 04:03
Matt Corgan summarized the pro and con of having large number of regions
here: https://issues.apache.org/jira/browse/HBASE-7667?focusedCommentId=13575024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13575024 Cheers On Sun, Feb 10, 2013 at 7:43 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote: > Hi Kevin, > > Thanks a lot for your great answers. > Regarding Q5. To clarify, > > lets say Facebook is using HBase for the integrated messaging/chat/email > system in a very large-scale setup. And schema design of such system can > change over the years (even over the months). Workload patterns may also > change due to different usage characteristics (like the rate of messaging > may be higher during a protest/specific event in a particular country). So, > region/table hotspots have been created at random region servers within the > cluster despite careful schema design and pre-planning. > > The facebook team rush to split the hotspotted regions manually and > redistribute them over a new set of physical machines which are recently > added to the system to increase scalability in the face of high user > demand. Now hotspotted region data could be transferred into new physical > machines gradually to handle the situations. Now if the shard (region) size > is small enough then data transfer cost over the network could be minimum > otherwise large volume of data needs to be transferred instantly. > > I have found in many places it is discouraged to have a large number of > regions systems. However, would it be possible to have very large number of > regions in a system thus minimizing data transfer cost in case hotspotting > due to workload/design characteristics. Is there any drawbacks or known > side-effects? I am rethinking other possibilities other pre-planned schema > and row-key designs. > > > Thanks again. > > Regards, > Joarder Kamal > > > On 11 February 2013 13:32, Kevin O'dell <[EMAIL PROTECTED]> wrote: > > > Hi Joarder, > > > > Welcome to the HBase world. Let me take some time to address your > > questions the best I can: > > > > 1. How often you are facing Region or Table Hotspotting in HBase > > production systems? <--- Hotspotting is not something that just > happens. > > This is usually caused by bad key design and writing to one region more > > than the others. I would recommend watching some of Lar's YouTube videos > > on Schema Design in HBase. > > 2. If a hotspot is created, how quickly it is automatically cleared > out > > (assuming sudden workload change)? <--- It will not be automatically > > "cleared out" I think you may be misinformed here. Basically, it is on > you > > to watch you table and your write distribution and determine that you > have > > a hotspot and take the necessary action. Usually the only action is to > > split the region. If hotspots become a habitual problem you would most > > likely want to go back and re-evaluate your current key. > > 3. How often this kind of situation happens - A hotspot is detected > and > > vanished out before taking an action? or hotspots stays longer period > of > > time? <--- Please see above > > 4. Or if the hotspot is stays, how it is handled (in general) in > > production system? <--- Some people have to hotspot on purpose early > on, > > because they only write to a subset of regions. You will have to > manually > > watch for hotspots(which is much easier in later releases). > > 5. How large data transfer cost is minimized or avoid for re-sharding > > regions within a cluster in a single data center or within WAN? <--- > Not > > quite sure what you are saying here, so I will take a best guess at it. > > Sharding is handled in HBase by region splitting. The best way to > success > > in HBase is to understand your data and you needs BEFORE you create you > > table and start writing into HBase. This way you can presplit your table > > to handle the incoming data and you won't have to do a massive amounts of
-
Re: HBase Region/Table Hotspottinglars hofhansl 2013-02-11, 05:14
The most common cause for hotspotting is inserting rows with monotonically increasing row keys.
In that case only the last region will get the writes and no amount of splitting will fix that (only one region serer will hold the last region of the table regardless of how small it is). There are ways around this. If you generate keys make sure they are not monotonically increasing. For example if you do not care about the sort order of the keys w.r.t. to each other you could reverse the bytes before you use them as row key. Another option is to prefix the key with a hash of the key (but then you loose the ability to do range scan across keys). If you still need to scan rows according to their sort order you can "salt" (as some call it) the key by prefix it with a limited number of random single digit (maybe 5-10 different numbers). Could also do a mod of the key. Each scan then has to issue multiple scans in parallel for each of the possible prefix numbers. (In fact that is a pretty effective way to avoid hotspotting and to parallelize your scans, but it needs some client side to reconcile the parallel scans). Another reason for hotspotting is inserting new versions a of small'ish set of row keys. In that case splitting might help, because it will increase the likelyhood of all those key falling into the same region. -- Lars ________________________________ From: Joarder KAMAL <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Sunday, February 10, 2013 6:17 PM Subject: HBase Region/Table Hotspotting This is my first email in the group. I am having a more general and open-ended question but hope to get some reasoning from the HBase user communities. I am a very basic HBase user and still learning. My intention to use HBase in one of our research project. Recently I was looking through Lars George's book "HBase - The Definitive Guide" and two particular topics caught my eyes. One is 'Region and Table Hotspotting' and the other is 'Region Auto-Sharding and Merging'. *Scenario: * If a hotspot is created in a particular region or in a table (having multiple regions) due to sudden workload change, then one may split the region into further small pieces and distributed it to a number of available physical machine in the cluster. This process should require large data transfer between different machines in the cluster and incur a performance cost. One may also change the 'key' definition and manage the regions. But I am not sure how effective or logical to change key designs on a production system. *Questions:* 1. How often you are facing Region or Table Hotspotting in HBase production systems? 2. If a hotspot is created, how quickly it is automatically cleared out (assuming sudden workload change)? 3. How often this kind of situation happens - A hotspot is detected and vanished out before taking an action? or hotspots stays longer period of time? 4. Or if the hotspot is stays, how it is handled (in general) in production system? 5. How large data transfer cost is minimized or avoid for re-sharding regions within a cluster in a single data center or within WAN? 6. Is hotspoting in HBase cluster is really a issue (big!) nowadays for OLAP workloads and real-time analytics? Further directions to more information about region/table hotspotting is most welcome. Many thanks in advance. Regards, Joarder Kamal
-
Re: HBase Region/Table HotspottingJoarder KAMAL 2013-02-11, 05:56
Thanks Lars for explaining the reasons for hotspotting and key design
techniques. Just wondering, is it possible to alter key design (e.g. from sequential keys to salt keys) at run time in the production system? What are the impacts? To Ted, Thanks a lot for point out at [HBASE-7667]. Interesting idea indeed. And Matt Corgan explained the trade-offs between having fewer and more regions. He also pointed out how a large number of regions can impact the compaction process. Although I am an expert on HBase system, but what did you think about how to find an optimal value of stripes or sub-region for each region? Actually I didn't get the idea of having a fixed boundary stripes. Thanks again. HBase community is really great !! Regards, Joarder Kamal On 11 February 2013 16:14, lars hofhansl <[EMAIL PROTECTED]> wrote: > The most common cause for hotspotting is inserting rows with monotonically > increasing row keys. > In that case only the last region will get the writes and no amount of > splitting will fix that (only one region serer will hold the last region of > the table regardless of how small it is). > There are ways around this. If you generate keys make sure they are not > monotonically increasing. For example if you do not care about the sort > order of the keys w.r.t. to each other you could reverse the bytes before > you use them as row key. Another option is to prefix the key with a hash of > the key (but then you loose the ability to do range scan across keys). > > If you still need to scan rows according to their sort order you can > "salt" (as some call it) the key by prefix it with a limited number of > random single digit (maybe 5-10 different numbers). Could also do a mod of > the key. Each scan then has to issue multiple scans in parallel for each of > the possible prefix numbers. > (In fact that is a pretty effective way to avoid hotspotting and to > parallelize your scans, but it needs some client side to reconcile the > parallel scans). > > Another reason for hotspotting is inserting new versions a of small'ish > set of row keys. In that case splitting might help, because it will > increase the likelyhood of all those key falling into the same region. > > > -- Lars > > > > ________________________________ > From: Joarder KAMAL <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Sent: Sunday, February 10, 2013 6:17 PM > Subject: HBase Region/Table Hotspotting > > This is my first email in the group. I am having a more general and > open-ended question but hope to get some reasoning from the HBase user > communities. > I am a very basic HBase user and still learning. My intention to use HBase > in one of our research project. Recently I was looking through Lars > George's book "HBase - The Definitive Guide" and two particular topics > caught my eyes. One is 'Region and Table Hotspotting' and the other is > 'Region Auto-Sharding and Merging'. > > *Scenario: * > If a hotspot is created in a particular region or in a table (having > multiple regions) due to sudden workload change, then one may split the > region into further small pieces and distributed it to a number of > available physical machine in the cluster. This process should require > large data transfer between different machines in the cluster and incur a > performance cost. One may also change the 'key' definition and manage the > regions. But I am not sure how effective or logical to change key designs > on a production system. > > *Questions:* > > 1. How often you are facing Region or Table Hotspotting in HBase > production systems? > 2. If a hotspot is created, how quickly it is automatically cleared out > (assuming sudden workload change)? > 3. How often this kind of situation happens - A hotspot is detected and > vanished out before taking an action? or hotspots stays longer period of > time? > 4. Or if the hotspot is stays, how it is handled (in general) in > production system? > 5. How large data transfer cost is minimized or avoid for re-sharding
-
Re: HBase Region/Table HotspottingTed Yu 2013-02-11, 14:55
Sub-region management is in experimental stage.
We will get better idea when HBASE-7667 gets in-depth review and more cluster-level testing is done. You can watch HBASE-7667 so that you get updates. Cheers On Sun, Feb 10, 2013 at 9:56 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote: > Thanks Lars for explaining the reasons for hotspotting and key design > techniques. > Just wondering, is it possible to alter key design (e.g. from sequential > keys to salt keys) at run time in the production system? What are the > impacts? > > To Ted, > Thanks a lot for point out at [HBASE-7667]. Interesting idea indeed. And > Matt Corgan explained the trade-offs between having fewer and more regions. > He also pointed out how a large number of regions can impact the compaction > process. Although I am an expert on HBase system, but what did you think > about how to find an optimal value of stripes or sub-region for each > region? Actually I didn't get the idea of having a fixed boundary stripes. > > Thanks again. > HBase community is really great !! > > > > Regards, > Joarder Kamal > > > > On 11 February 2013 16:14, lars hofhansl <[EMAIL PROTECTED]> wrote: > > > The most common cause for hotspotting is inserting rows with > monotonically > > increasing row keys. > > In that case only the last region will get the writes and no amount of > > splitting will fix that (only one region serer will hold the last region > of > > the table regardless of how small it is). > > There are ways around this. If you generate keys make sure they are not > > monotonically increasing. For example if you do not care about the sort > > order of the keys w.r.t. to each other you could reverse the bytes before > > you use them as row key. Another option is to prefix the key with a hash > of > > the key (but then you loose the ability to do range scan across keys). > > > > If you still need to scan rows according to their sort order you can > > "salt" (as some call it) the key by prefix it with a limited number of > > random single digit (maybe 5-10 different numbers). Could also do a mod > of > > the key. Each scan then has to issue multiple scans in parallel for each > of > > the possible prefix numbers. > > (In fact that is a pretty effective way to avoid hotspotting and to > > parallelize your scans, but it needs some client side to reconcile the > > parallel scans). > > > > Another reason for hotspotting is inserting new versions a of small'ish > > set of row keys. In that case splitting might help, because it will > > increase the likelyhood of all those key falling into the same region. > > > > > > -- Lars > > > > > > > > ________________________________ > > From: Joarder KAMAL <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > > Sent: Sunday, February 10, 2013 6:17 PM > > Subject: HBase Region/Table Hotspotting > > > > This is my first email in the group. I am having a more general and > > open-ended question but hope to get some reasoning from the HBase user > > communities. > > I am a very basic HBase user and still learning. My intention to use > HBase > > in one of our research project. Recently I was looking through Lars > > George's book "HBase - The Definitive Guide" and two particular topics > > caught my eyes. One is 'Region and Table Hotspotting' and the other is > > 'Region Auto-Sharding and Merging'. > > > > *Scenario: * > > If a hotspot is created in a particular region or in a table (having > > multiple regions) due to sudden workload change, then one may split the > > region into further small pieces and distributed it to a number of > > available physical machine in the cluster. This process should require > > large data transfer between different machines in the cluster and incur a > > performance cost. One may also change the 'key' definition and manage the > > regions. But I am not sure how effective or logical to change key designs > > on a production system. > > > > *Questions:* > > > > 1. How often you are facing Region or Table Hotspotting in HBase |