Matt Corgan summarized the pro and con of having large number of regions
On Sun, Feb 10, 2013 at 7:43 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote:
> Hi Kevin,
> Thanks a lot for your great answers.
> Regarding Q5. To clarify,
> lets say Facebook is using HBase for the integrated messaging/chat/email
> system in a very large-scale setup. And schema design of such system can
> change over the years (even over the months). Workload patterns may also
> change due to different usage characteristics (like the rate of messaging
> may be higher during a protest/specific event in a particular country). So,
> region/table hotspots have been created at random region servers within the
> cluster despite careful schema design and pre-planning.
> The facebook team rush to split the hotspotted regions manually and
> redistribute them over a new set of physical machines which are recently
> added to the system to increase scalability in the face of high user
> demand. Now hotspotted region data could be transferred into new physical
> machines gradually to handle the situations. Now if the shard (region) size
> is small enough then data transfer cost over the network could be minimum
> otherwise large volume of data needs to be transferred instantly.
> I have found in many places it is discouraged to have a large number of
> regions systems. However, would it be possible to have very large number of
> regions in a system thus minimizing data transfer cost in case hotspotting
> due to workload/design characteristics. Is there any drawbacks or known
> side-effects? I am rethinking other possibilities other pre-planned schema
> and row-key designs.
> Thanks again.
> Joarder Kamal
> On 11 February 2013 13:32, Kevin O'dell <[EMAIL PROTECTED]> wrote:
> > Hi Joarder,
> > Welcome to the HBase world. Let me take some time to address your
> > questions the best I can:
> > 1. How often you are facing Region or Table Hotspotting in HBase
> > production systems? <--- Hotspotting is not something that just
> > This is usually caused by bad key design and writing to one region more
> > than the others. I would recommend watching some of Lar's YouTube videos
> > on Schema Design in HBase.
> > 2. If a hotspot is created, how quickly it is automatically cleared
> > (assuming sudden workload change)? <--- It will not be automatically
> > "cleared out" I think you may be misinformed here. Basically, it is on
> > to watch you table and your write distribution and determine that you
> > a hotspot and take the necessary action. Usually the only action is to
> > split the region. If hotspots become a habitual problem you would most
> > likely want to go back and re-evaluate your current key.
> > 3. How often this kind of situation happens - A hotspot is detected
> > vanished out before taking an action? or hotspots stays longer period
> > time? <--- Please see above
> > 4. Or if the hotspot is stays, how it is handled (in general) in
> > production system? <--- Some people have to hotspot on purpose early
> > because they only write to a subset of regions. You will have to
> > watch for hotspots(which is much easier in later releases).
> > 5. How large data transfer cost is minimized or avoid for re-sharding
> > regions within a cluster in a single data center or within WAN? <---
> > quite sure what you are saying here, so I will take a best guess at it.
> > Sharding is handled in HBase by region splitting. The best way to
> > in HBase is to understand your data and you needs BEFORE you create you
> > table and start writing into HBase. This way you can presplit your table
> > to handle the incoming data and you won't have to do a massive amounts of