Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> HBase Region/Table Hotspotting


+
Joarder KAMAL 2013-02-11, 02:17
+
lars hofhansl 2013-02-11, 05:14
+
Joarder KAMAL 2013-02-11, 05:56
+
Ted Yu 2013-02-11, 14:55
+
Kevin Odell 2013-02-11, 02:32
+
Joarder KAMAL 2013-02-11, 03:43
Copy link to this message
-
Re: HBase Region/Table Hotspotting
Matt Corgan summarized the pro and con of having large number of regions
here:

https://issues.apache.org/jira/browse/HBASE-7667?focusedCommentId=13575024&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13575024

Cheers

On Sun, Feb 10, 2013 at 7:43 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote:

> Hi Kevin,
>
> Thanks a lot for your great answers.
> Regarding Q5. To clarify,
>
> lets say Facebook is using HBase for the integrated messaging/chat/email
> system in a very large-scale setup. And schema design of such system can
> change over the years (even over the months). Workload patterns may also
> change due to different usage characteristics (like the rate of messaging
> may be higher during a protest/specific event in a particular country). So,
> region/table hotspots have been created at random region servers within the
> cluster despite careful schema design and pre-planning.
>
> The facebook team rush to split the hotspotted regions manually and
> redistribute them over a new set of physical machines which are recently
> added to the system to increase scalability in the face of high user
> demand. Now hotspotted region data could be transferred into new physical
> machines gradually to handle the situations. Now if the shard (region) size
> is small enough then data transfer cost over the network could be minimum
> otherwise large volume of data needs to be transferred instantly.
>
> I have found in many places it is discouraged to have a large number of
> regions systems. However, would it be possible to have very large number of
> regions in a system thus minimizing data transfer cost in case hotspotting
> due to workload/design characteristics. Is there any drawbacks or known
> side-effects? I am rethinking other possibilities other pre-planned schema
> and row-key designs.
>
>
> Thanks again.
>
> Regards,
> Joarder Kamal
>
>
> On 11 February 2013 13:32, Kevin O'dell <[EMAIL PROTECTED]> wrote:
>
> > Hi Joarder,
> >
> >   Welcome to the HBase world.  Let me take some time to address your
> > questions the best I can:
> >
> >  1. How often you are facing Region or Table Hotspotting in HBase
> >    production systems? <--- Hotspotting is not something that just
> happens.
> >  This is usually caused by bad key design and writing to one region more
> > than the others.  I would recommend watching some of Lar's YouTube videos
> > on Schema Design in HBase.
> >    2. If a hotspot is created, how quickly it is automatically cleared
> out
> >    (assuming sudden workload change)? <--- It will not be automatically
> > "cleared out" I think you may be misinformed here.  Basically, it is on
> you
> > to watch you table and your write distribution and determine that you
> have
> > a hotspot and take the necessary action.  Usually the only action is to
> > split the region.  If hotspots become a habitual problem you would most
> > likely want to go back and re-evaluate your current key.
> >    3. How often this kind of situation happens - A hotspot is detected
> and
> >    vanished out before taking an action? or hotspots stays longer period
> of
> >    time? <--- Please see above
> >    4. Or if the hotspot is stays, how it is handled (in general) in
> >    production system? <--- Some people have to hotspot on purpose early
> on,
> > because they only write to a subset of regions.  You will have to
> manually
> > watch for hotspots(which is much easier in later releases).
> >    5. How large data transfer cost is minimized or avoid for re-sharding
> >    regions within a cluster in a single data center or within WAN? <---
> Not
> > quite sure what you are saying here, so I will take a best guess at it.
> >  Sharding is handled in HBase by region splitting.  The best way to
> success
> > in HBase is to understand your data and you needs BEFORE you create you
> > table and start writing into HBase.  This way you can presplit your table
> > to handle the incoming data and you won't have to do a massive amounts of
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB