Welcome to the HBase world. Let me take some time to address your
questions the best I can:
1. How often you are facing Region or Table Hotspotting in HBase
production systems? <--- Hotspotting is not something that just happens.
This is usually caused by bad key design and writing to one region more
than the others. I would recommend watching some of Lar's YouTube videos
on Schema Design in HBase.
2. If a hotspot is created, how quickly it is automatically cleared out
(assuming sudden workload change)? <--- It will not be automatically
"cleared out" I think you may be misinformed here. Basically, it is on you
to watch you table and your write distribution and determine that you have
a hotspot and take the necessary action. Usually the only action is to
split the region. If hotspots become a habitual problem you would most
likely want to go back and re-evaluate your current key.
3. How often this kind of situation happens - A hotspot is detected and
vanished out before taking an action? or hotspots stays longer period of
time? <--- Please see above
4. Or if the hotspot is stays, how it is handled (in general) in
production system? <--- Some people have to hotspot on purpose early on,
because they only write to a subset of regions. You will have to manually
watch for hotspots(which is much easier in later releases).
5. How large data transfer cost is minimized or avoid for re-sharding
regions within a cluster in a single data center or within WAN? <--- Not
quite sure what you are saying here, so I will take a best guess at it.
Sharding is handled in HBase by region splitting. The best way to success
in HBase is to understand your data and you needs BEFORE you create you
table and start writing into HBase. This way you can presplit your table
to handle the incoming data and you won't have to do a massive amounts of
splits. Later you can allow HBase to split your tables manually, or you
can set the maxfile size high and manually control the splits or sharding.
6. Is hotspoting in HBase cluster is really a issue (big!) nowadays for
OLAP workloads and real-time analytics? <--- Just design your schema
correctly and this should not be a problem for you.
Please let me know if this answers your questions.
On Sun, Feb 10, 2013 at 9:17 PM, Joarder KAMAL <[EMAIL PROTECTED]> wrote:
> This is my first email in the group. I am having a more general and
> open-ended question but hope to get some reasoning from the HBase user
> I am a very basic HBase user and still learning. My intention to use HBase
> in one of our research project. Recently I was looking through Lars
> George's book "HBase - The Definitive Guide" and two particular topics
> caught my eyes. One is 'Region and Table Hotspotting' and the other is
> 'Region Auto-Sharding and Merging'.
> *Scenario: *
> If a hotspot is created in a particular region or in a table (having
> multiple regions) due to sudden workload change, then one may split the
> region into further small pieces and distributed it to a number of
> available physical machine in the cluster. This process should require
> large data transfer between different machines in the cluster and incur a
> performance cost. One may also change the 'key' definition and manage the
> regions. But I am not sure how effective or logical to change key designs
> on a production system.
> 1. How often you are facing Region or Table Hotspotting in HBase
> production systems?
> 2. If a hotspot is created, how quickly it is automatically cleared out
> (assuming sudden workload change)?
> 3. How often this kind of situation happens - A hotspot is detected and
> vanished out before taking an action? or hotspots stays longer period of
> 4. Or if the hotspot is stays, how it is handled (in general) in
> production system?
> 5. How large data transfer cost is minimized or avoid for re-sharding
> regions within a cluster in a single data center or within WAN?
Customer Operations Engineer, Cloudera