|
|
-
Re: HBase Region/Table HotspottingTed Yu 2013-02-11, 04:03
Matt Corgan summarized the pro and con of having large number of regions
here: https://issues.apache.org/jira/browse/HBASE-7667?focusedCommentId=13575024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13575024 Cheers On Sun, Feb 10, 2013 at 7:43 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote: > Hi Kevin, > > Thanks a lot for your great answers. > Regarding Q5. To clarify, > > lets say Facebook is using HBase for the integrated messaging/chat/email > system in a very large-scale setup. And schema design of such system can > change over the years (even over the months). Workload patterns may also > change due to different usage characteristics (like the rate of messaging > may be higher during a protest/specific event in a particular country). So, > region/table hotspots have been created at random region servers within the > cluster despite careful schema design and pre-planning. > > The facebook team rush to split the hotspotted regions manually and > redistribute them over a new set of physical machines which are recently > added to the system to increase scalability in the face of high user > demand. Now hotspotted region data could be transferred into new physical > machines gradually to handle the situations. Now if the shard (region) size > is small enough then data transfer cost over the network could be minimum > otherwise large volume of data needs to be transferred instantly. > > I have found in many places it is discouraged to have a large number of > regions systems. However, would it be possible to have very large number of > regions in a system thus minimizing data transfer cost in case hotspotting > due to workload/design characteristics. Is there any drawbacks or known > side-effects? I am rethinking other possibilities other pre-planned schema > and row-key designs. > > > Thanks again. > > Regards, > Joarder Kamal > > > On 11 February 2013 13:32, Kevin O'dell <[EMAIL PROTECTED]> wrote: > > > Hi Joarder, > > > > Welcome to the HBase world. Let me take some time to address your > > questions the best I can: > > > > 1. How often you are facing Region or Table Hotspotting in HBase > > production systems? <--- Hotspotting is not something that just > happens. > > This is usually caused by bad key design and writing to one region more > > than the others. I would recommend watching some of Lar's YouTube videos > > on Schema Design in HBase. > > 2. If a hotspot is created, how quickly it is automatically cleared > out > > (assuming sudden workload change)? <--- It will not be automatically > > "cleared out" I think you may be misinformed here. Basically, it is on > you > > to watch you table and your write distribution and determine that you > have > > a hotspot and take the necessary action. Usually the only action is to > > split the region. If hotspots become a habitual problem you would most > > likely want to go back and re-evaluate your current key. > > 3. How often this kind of situation happens - A hotspot is detected > and > > vanished out before taking an action? or hotspots stays longer period > of > > time? <--- Please see above > > 4. Or if the hotspot is stays, how it is handled (in general) in > > production system? <--- Some people have to hotspot on purpose early > on, > > because they only write to a subset of regions. You will have to > manually > > watch for hotspots(which is much easier in later releases). > > 5. How large data transfer cost is minimized or avoid for re-sharding > > regions within a cluster in a single data center or within WAN? <--- > Not > > quite sure what you are saying here, so I will take a best guess at it. > > Sharding is handled in HBase by region splitting. The best way to > success > > in HBase is to understand your data and you needs BEFORE you create you > > table and start writing into HBase. This way you can presplit your table > > to handle the incoming data and you won't have to do a massive amounts of |