|
|
-
How does HBase perform load balancing?
MauMau 2010-05-08, 09:17
Hello, I got the following error when I sent the mail. Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the recipient domain. We recommend contacting the other email provider for further information about the cause of this error. The error that the other server returned was: 552 552 spam score (5.2) exceeded threshold (state 18). The original mail might have been too long, so let me split it and send again. I'm comparing HBase and Cassandra, which I think are the most promising distributed key-value stores, to determine which one to choose for the future OLTP and data analysis. I found the following benchmark report by Yahoo! Research which evalutes HBase, Cassandra, PNUTS, and sharded MySQL. http://wiki.apache.org/hadoop/Hbase/DesignOverviewThe above report refers to HBase 0.20.3. Reading this and HBase's documentation, two questions about load balancing and replication have risen. Could anyone give me any information to help solve these questions? [Q1] Load balancing Does HBase move regions to a newly added region server (logically, not physically on storage) immediately? If not immediately, what timing? On what criteria does the master unassign and assign regions among region servers? CPU load, read/write request rates, or just the number of regions the region servers are handling? According the HBase design overview on the page below, the master monitors the load of each region server and moves regions. http://wiki.apache.org/hadoop/Hbase/DesignOverviewThe related part is the following: ---------------------------------------- HMaster duties: Assigning/unassigning regions to/from HRegionServers (unassigning is for load balance) Monitor the health and load of each HRegionServer ... If HMaster detects overloaded or low loaded H!RegionServer, it will unassign (close) some regions from most loaded H!RegionServer. Unassigned regions will be assigned to low loaded servers. ---------------------------------------- When I read the above, I thought that the master checks the load of region servers periodically (once a few minutes or so) and performs load balancing. And I thought that the master unassigns some regions from the existing loaded region servers to a newly added one immediately when the new server joins the cluster and contacts the master. However, the benchmark report by Yahoo! Research describes as follows. This says that HBase does not move regions until compaction, so I cannot get the effect of adding new servers immediately even if I added the new server to solve the overload problem. What's the fact? ---------------------------------------- 6.7 Elastic Speedup As the figure shows, the read latency spikes initially after the sixth server is added, before the latency stabilizes at a value slightly lower than the latency for five servers. This result indicates that HBase is able to shift read and write load to the new server, resulting in lower latency. HBase does not move existing data to the new server until compactions occur2. The result is less latency variance compared to Cassandra since there is no repartitioning process competing for the disk. However, the new server is underutilized, since existing data is served off the old servers. ... 2 It is possible to run the HDFS load balancer to force data to the new servers, but this greatly disrupts HBase’s ability to serve data partitions from the same servers on which they are stored. ---------------------------------------- MauMau
-
Re: How does HBase perform load balancing?
Ryan Rawson 2010-05-08, 09:30
hey, HBase currently uses region count to load balance. Regions are assigned in a semi-randomish order to other regionservers. The paper is somewhat correct in that we are not moving data around aggressively, because then people would write in complaining we move data around too much :-) So a few notes, HBase is not a key-value store, its a tabluar data store, which maintains key order, and allows the easy construction of left-match key indexes. One other thing... if you are using a DHT (eg: cassandra), when a node fails the load moves to the other servers in the ring-segment. For example if you have N=3 and you lose a node in a segment, the load of a server would move to 2 other servers. Your monitoring system should probably be tied into the DHT topology since if a second node fails in the same ring you probably want to take action. Ironically nodes in cassandra are special (unlike the publicly stated info) and they "belong" to a particular ring segment and cannot be used to store other data. There are tools to do node swap in, but you want your cluster management to be as automated as possible. Compared to a bigtable architecture, the load of a failed regionserver is evenly spread across the entire rest of the cluster. No node has a special role in HDFS and HBase, any data can be hosted and served from any node. As nodes fail, as long as you have enough nodes to serve the load you are in good shape. The HDFS missing block report lets you know when you have lost too many nodes. Nodes have no special role and can host and hold any data. In the future we want to add a load balancing based on requests/second. We have all the requisite data and architecture, but other things are up more important right now. Pure region count load balancing tends to work fairly well in practice. 2010/5/8 MauMau <[EMAIL PROTECTED]>: > Hello, > > I got the following error when I sent the mail. > > Technical details of permanent failure: > Google tried to deliver your message, but it was rejected by the recipient > domain. We recommend contacting the other email provider for further > information about the cause of this error. The error that the other server > returned was: 552 552 spam score (5.2) exceeded threshold (state 18). > > The original mail might have been too long, so let me split it and send > again. > > > I'm comparing HBase and Cassandra, which I think are the most promising > distributed key-value stores, to determine which one to choose for the > future OLTP and data analysis. > I found the following benchmark report by Yahoo! Research which evalutes > HBase, Cassandra, PNUTS, and sharded MySQL. > > http://wiki.apache.org/hadoop/Hbase/DesignOverview> > The above report refers to HBase 0.20.3. > Reading this and HBase's documentation, two questions about load balancing > and replication have risen. Could anyone give me any information to help > solve these questions? > > [Q1] Load balancing > Does HBase move regions to a newly added region server (logically, not > physically on storage) immediately? If not immediately, what timing? > On what criteria does the master unassign and assign regions among region > servers? CPU load, read/write request rates, or just the number of regions > the region servers are handling? > > According the HBase design overview on the page below, the master monitors > the load of each region server and moves regions. > > http://wiki.apache.org/hadoop/Hbase/DesignOverview> > The related part is the following: > > ---------------------------------------- > HMaster duties: > Assigning/unassigning regions to/from HRegionServers (unassigning is for > load balance) > Monitor the health and load of each HRegionServer > ... > If HMaster detects overloaded or low loaded H!RegionServer, it will unassign > (close) some regions from most loaded H!RegionServer. Unassigned regions > will be assigned to low loaded servers. > ---------------------------------------- > > When I read the above, I thought that the master checks the load of region
-
Re: How does HBase perform load balancing?
Kevin Apte 2010-05-08, 10:36
Are these the good links for the Yahoo Benchmarks? http://www.brianfrankcooper.net/pubs/ycsb-v4.pdfhttp://research.*yahoo*.com/files/ycsb.pdfKevin On Sat, May 8, 2010 at 3:00 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > hey, > > HBase currently uses region count to load balance. Regions are > assigned in a semi-randomish order to other regionservers. > > The paper is somewhat correct in that we are not moving data around > aggressively, because then people would write in complaining we move > data around too much :-) > > So a few notes, HBase is not a key-value store, its a tabluar data > store, which maintains key order, and allows the easy construction of > left-match key indexes. > > One other thing... if you are using a DHT (eg: cassandra), when a node > fails the load moves to the other servers in the ring-segment. For > example if you have N=3 and you lose a node in a segment, the load of > a server would move to 2 other servers. Your monitoring system should > probably be tied into the DHT topology since if a second node fails in > the same ring you probably want to take action. Ironically nodes in > cassandra are special (unlike the publicly stated info) and they > "belong" to a particular ring segment and cannot be used to store > other data. There are tools to do node swap in, but you want your > cluster management to be as automated as possible. > > Compared to a bigtable architecture, the load of a failed regionserver > is evenly spread across the entire rest of the cluster. No node has a > special role in HDFS and HBase, any data can be hosted and served from > any node. As nodes fail, as long as you have enough nodes to serve > the load you are in good shape. The HDFS missing block report lets you > know when you have lost too many nodes. Nodes have no special role and > can host and hold any data. > > In the future we want to add a load balancing based on > requests/second. We have all the requisite data and architecture, but > other things are up more important right now. Pure region count load > balancing tends to work fairly well in practice. > > 2010/5/8 MauMau <[EMAIL PROTECTED]>: > > Hello, > > > > I got the following error when I sent the mail. > > > > Technical details of permanent failure: > > Google tried to deliver your message, but it was rejected by the > recipient > > domain. We recommend contacting the other email provider for further > > information about the cause of this error. The error that the other > server > > returned was: 552 552 spam score (5.2) exceeded threshold (state 18). > > > > The original mail might have been too long, so let me split it and send > > again. > > > > > > I'm comparing HBase and Cassandra, which I think are the most promising > > distributed key-value stores, to determine which one to choose for the > > future OLTP and data analysis. > > I found the following benchmark report by Yahoo! Research which evalutes > > HBase, Cassandra, PNUTS, and sharded MySQL. > > > > http://wiki.apache.org/hadoop/Hbase/DesignOverview> > > > The above report refers to HBase 0.20.3. > > Reading this and HBase's documentation, two questions about load > balancing > > and replication have risen. Could anyone give me any information to help > > solve these questions? > > > > [Q1] Load balancing > > Does HBase move regions to a newly added region server (logically, not > > physically on storage) immediately? If not immediately, what timing? > > On what criteria does the master unassign and assign regions among region > > servers? CPU load, read/write request rates, or just the number of > regions > > the region servers are handling? > > > > According the HBase design overview on the page below, the master > monitors > > the load of each region server and moves regions. > > > > http://wiki.apache.org/hadoop/Hbase/DesignOverview> > > > The related part is the following: > > > > ---------------------------------------- > > HMaster duties: > > Assigning/unassigning regions to/from HRegionServers (unassigning is for
-
Re: How does HBase perform load balancing?
MauMau 2010-05-08, 11:47
Thank you, Ryan
I look forward to the future advancement in load balancing strategy --- distributing "hot" regions among lightly loaded region servers based on request count.
Thanks for the note that HBase is not a key-value store. Yes, some people say like you, and others say that HBase is one of the key-value stores. I think they call HBase call HBase as a key-value store in the sense that users access data by keys (row key, column key).
> the same ring you probably want to take action. Ironically nodes in > cassandra are special (unlike the publicly stated info) and they > "belong" to a particular ring segment and cannot be used to store > other data. There are tools to do node swap in, but you want your
Do you mean "preference list" by "ring segment"? If all nodes in the preference list for particular data get down before another node takes over the data, the data cannot be read. But I thought writes can be accepted by the mechanism called hinted handoff. However, the probability that all nodes in the preference list crash is very low. That is the same for HBase; if all data nodes for particular data of HBase get down, that data cannot be accessed.
The original concern is whether I can answer "yes" in the following situation:
Customer: The system is slow. How can we improve the responsiveness? Me: The response time of the data store IS high, and all region servers are busy. You can improve the situation by adding some region servers. Customer: Just adding new servers is OK? Can we see the improvement soon?
Maumau ----- Original Message ----- From: "Ryan Rawson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Saturday, May 08, 2010 6:30 PM Subject: Re: How does HBase perform load balancing? hey,
HBase currently uses region count to load balance. Regions are assigned in a semi-randomish order to other regionservers.
The paper is somewhat correct in that we are not moving data around aggressively, because then people would write in complaining we move data around too much :-)
So a few notes, HBase is not a key-value store, its a tabluar data store, which maintains key order, and allows the easy construction of left-match key indexes.
One other thing... if you are using a DHT (eg: cassandra), when a node fails the load moves to the other servers in the ring-segment. For example if you have N=3 and you lose a node in a segment, the load of a server would move to 2 other servers. Your monitoring system should probably be tied into the DHT topology since if a second node fails in the same ring you probably want to take action. Ironically nodes in cassandra are special (unlike the publicly stated info) and they "belong" to a particular ring segment and cannot be used to store other data. There are tools to do node swap in, but you want your cluster management to be as automated as possible.
Compared to a bigtable architecture, the load of a failed regionserver is evenly spread across the entire rest of the cluster. No node has a special role in HDFS and HBase, any data can be hosted and served from any node. As nodes fail, as long as you have enough nodes to serve the load you are in good shape. The HDFS missing block report lets you know when you have lost too many nodes. Nodes have no special role and can host and hold any data.
In the future we want to add a load balancing based on requests/second. We have all the requisite data and architecture, but other things are up more important right now. Pure region count load balancing tends to work fairly well in practice.
-
Re: How does HBase perform load balancing?
Amandeep Khurana 2010-05-08, 21:48
The Yahoo! research link is the most recent one afaik... Thats the one submitted to SOCC'10 On Sat, May 8, 2010 at 3:36 AM, Kevin Apte <[EMAIL PROTECTED] > wrote: > Are these the good links for the Yahoo Benchmarks? > http://www.brianfrankcooper.net/pubs/ycsb-v4.pdf> http://research.*yahoo*.com/files/ycsb.pdf> > Kevin > > > On Sat, May 8, 2010 at 3:00 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote: > > > hey, > > > > HBase currently uses region count to load balance. Regions are > > assigned in a semi-randomish order to other regionservers. > > > > The paper is somewhat correct in that we are not moving data around > > aggressively, because then people would write in complaining we move > > data around too much :-) > > > > So a few notes, HBase is not a key-value store, its a tabluar data > > store, which maintains key order, and allows the easy construction of > > left-match key indexes. > > > > One other thing... if you are using a DHT (eg: cassandra), when a node > > fails the load moves to the other servers in the ring-segment. For > > example if you have N=3 and you lose a node in a segment, the load of > > a server would move to 2 other servers. Your monitoring system should > > probably be tied into the DHT topology since if a second node fails in > > the same ring you probably want to take action. Ironically nodes in > > cassandra are special (unlike the publicly stated info) and they > > "belong" to a particular ring segment and cannot be used to store > > other data. There are tools to do node swap in, but you want your > > cluster management to be as automated as possible. > > > > Compared to a bigtable architecture, the load of a failed regionserver > > is evenly spread across the entire rest of the cluster. No node has a > > special role in HDFS and HBase, any data can be hosted and served from > > any node. As nodes fail, as long as you have enough nodes to serve > > the load you are in good shape. The HDFS missing block report lets you > > know when you have lost too many nodes. Nodes have no special role and > > can host and hold any data. > > > > In the future we want to add a load balancing based on > > requests/second. We have all the requisite data and architecture, but > > other things are up more important right now. Pure region count load > > balancing tends to work fairly well in practice. > > > > 2010/5/8 MauMau <[EMAIL PROTECTED]>: > > > Hello, > > > > > > I got the following error when I sent the mail. > > > > > > Technical details of permanent failure: > > > Google tried to deliver your message, but it was rejected by the > > recipient > > > domain. We recommend contacting the other email provider for further > > > information about the cause of this error. The error that the other > > server > > > returned was: 552 552 spam score (5.2) exceeded threshold (state 18). > > > > > > The original mail might have been too long, so let me split it and send > > > again. > > > > > > > > > I'm comparing HBase and Cassandra, which I think are the most promising > > > distributed key-value stores, to determine which one to choose for the > > > future OLTP and data analysis. > > > I found the following benchmark report by Yahoo! Research which > evalutes > > > HBase, Cassandra, PNUTS, and sharded MySQL. > > > > > > http://wiki.apache.org/hadoop/Hbase/DesignOverview> > > > > > The above report refers to HBase 0.20.3. > > > Reading this and HBase's documentation, two questions about load > > balancing > > > and replication have risen. Could anyone give me any information to > help > > > solve these questions? > > > > > > [Q1] Load balancing > > > Does HBase move regions to a newly added region server (logically, not > > > physically on storage) immediately? If not immediately, what timing? > > > On what criteria does the master unassign and assign regions among > region > > > servers? CPU load, read/write request rates, or just the number of > > regions > > > the region servers are handling?
-
Re: How does HBase perform load balancing?
Ryan Rawson 2010-05-08, 23:08
On Sat, May 8, 2010 at 4:47 AM, MauMau <[EMAIL PROTECTED]> wrote: > Thank you, Ryan > > I look forward to the future advancement in load balancing strategy --- > distributing "hot" regions among lightly loaded region servers based on > request count.
Here at Stumbleupon we handle 12,000 requests/second, some regionservers are a bit warmer than others, but it hasnt proven to be a serious issue.
> > Thanks for the note that HBase is not a key-value store. Yes, some people > say like you, and others say that HBase is one of the key-value stores. I > think they call HBase call HBase as a key-value store in the sense that > users access data by keys (row key, column key).
I think the equivalent would be to use, for example, mysql but never mention or use the join features. I don't want HBase to become typecast as just a key-value store, when the things you can do in HBase are substantially more. > >> the same ring you probably want to take action. Ironically nodes in >> cassandra are special (unlike the publicly stated info) and they >> "belong" to a particular ring segment and cannot be used to store >> other data. There are tools to do node swap in, but you want your > > Do you mean "preference list" by "ring segment"? If all nodes in the > preference list for particular data get down before another node takes over > the data, the data cannot be read. But I thought writes can be accepted by > the mechanism called hinted handoff. > However, the probability that all nodes in the preference list crash is very > low. That is the same for HBase; if all data nodes for particular data of > HBase get down, that data cannot be accessed. > > The original concern is whether I can answer "yes" in the following > situation: > > Customer: The system is slow. How can we improve the responsiveness? > Me: The response time of the data store IS high, and all region servers are > busy. You can improve the situation by adding some region servers. > Customer: Just adding new servers is OK? Can we see the improvement soon?
If you want more head room you add servers. HBase will reassign regions to those regionservers. You now have access to more CPU and RAM and have a larger and more effective block cache. The data doesn't get spread around, but you can initiate major compactions on some/all of the tables which will move data around immediately. There are no concerns for growing a cluster in this way - I have done it to double the size of a cluster and I saw immediate performance. I major compacted a table I was doing a map reduce on and I saw more performance improvements. In a live serving system you do NOT want to be accessing disk most the time - caching is the name of the game for reducing latency. Everyone does this (you think your google results are read from disk?) and it's a fairly uniform "law" of doing low latency services - RAM is king. And when you expand a HBase cluster you get more effective ram immediately - no rebalancing required (unlike DHT-based architectures).
> > Maumau > > > ----- Original Message ----- From: "Ryan Rawson" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Saturday, May 08, 2010 6:30 PM > Subject: Re: How does HBase perform load balancing? > > > hey, > > HBase currently uses region count to load balance. Regions are > assigned in a semi-randomish order to other regionservers. > > The paper is somewhat correct in that we are not moving data around > aggressively, because then people would write in complaining we move > data around too much :-) > > So a few notes, HBase is not a key-value store, its a tabluar data > store, which maintains key order, and allows the easy construction of > left-match key indexes. > > One other thing... if you are using a DHT (eg: cassandra), when a node > fails the load moves to the other servers in the ring-segment. For > example if you have N=3 and you lose a node in a segment, the load of > a server would move to 2 other servers. Your monitoring system should
-
Re: How does HBase perform load balancing?
MauMau 2010-05-09, 00:50
Hi, Ryan
From: "Ryan Rawson" <[EMAIL PROTECTED]> > Here at Stumbleupon we handle 12,000 requests/second, some > regionservers are a bit warmer than others, but it hasnt proven to be > a serious issue.
Thank you for sharing your precious experience. "12,000 requests/second" is great! > If you want more head room you add servers. HBase will reassign > regions to those regionservers. You now have access to more CPU and > RAM and have a larger and more effective block cache. The data doesn't > get spread around, but you can initiate major compactions on some/all > of the tables which will move data around immediately. There are no > concerns for growing a cluster in this way - I have done it to double > the size of a cluster and I saw immediate performance. I major > compacted a table I was doing a map reduce on and I saw more > performance improvements. In a live serving system you do NOT want to > be accessing disk most the time - caching is the name of the game for > reducing latency. Everyone does this (you think your google results > are read from disk?) and it's a fairly uniform "law" of doing low > latency services - RAM is king. And when you expand a HBase cluster > you get more effective ram immediately - no rebalancing required > (unlike DHT-based architectures).
What I understood from the above is as follows. I'd appreciate if you could point out if I am wrong.
1. I need to perform major-compaction to unassign regions from the existing loaded region servers to a new region server. I cannot reassign the regions just by doing minor compaction and letting the non-loaded new server perform major compaction later. Having the loaded existing server do heavy major compaction is a concern. 2. "no rebalancing required" means that the blocks of HDFS files for regions need not be moved from one datanode to another.
Thank you. Maumau
-
Re: How does HBase perform load balancing?
Ryan Rawson 2010-05-09, 01:08
On Sat, May 8, 2010 at 5:50 PM, MauMau <[EMAIL PROTECTED]> wrote: > Hi, Ryan > > From: "Ryan Rawson" <[EMAIL PROTECTED]> >> >> Here at Stumbleupon we handle 12,000 requests/second, some >> regionservers are a bit warmer than others, but it hasnt proven to be >> a serious issue. > > Thank you for sharing your precious experience. "12,000 requests/second" is > great! > > >> If you want more head room you add servers. HBase will reassign >> regions to those regionservers. You now have access to more CPU and >> RAM and have a larger and more effective block cache. The data doesn't >> get spread around, but you can initiate major compactions on some/all >> of the tables which will move data around immediately. There are no >> concerns for growing a cluster in this way - I have done it to double >> the size of a cluster and I saw immediate performance. I major >> compacted a table I was doing a map reduce on and I saw more >> performance improvements. In a live serving system you do NOT want to >> be accessing disk most the time - caching is the name of the game for >> reducing latency. Everyone does this (you think your google results >> are read from disk?) and it's a fairly uniform "law" of doing low >> latency services - RAM is king. And when you expand a HBase cluster >> you get more effective ram immediately - no rebalancing required >> (unlike DHT-based architectures). > > What I understood from the above is as follows. I'd appreciate if you could > point out if I am wrong. > > 1. I need to perform major-compaction to unassign regions from the existing > loaded region servers to a new region server.
This is not so - regions are automatically reassigned with no compaction necessary.
> I cannot reassign the regions just by doing minor compaction and letting the > non-loaded new server perform major compaction later. Having the loaded > existing server do heavy major compaction is a concern.
this is not what happens, regions are reassigned without requiring any compaction of any kind.
> 2. "no rebalancing required" means that the blocks of HDFS files for regions > need not be moved from one datanode to another.
So when you add nodes to a new cluster, unless you are running the HDFS balancer, data does not migrate. As HBase naturally compacts tables (once a day by default) it will end up rewriting data and causing its migration. You can help accelerate this process by manually kicking off a compaction for a large table if you have added a lot of new machines.
-ryan > > Thank you. > Maumau > >
-
Re: How does HBase perform load balancing?
MauMau 2010-05-09, 01:34
Thanks, Ryan,
- To utilize the CPU and memory of new region servers: I do not have to do anything. The master automatically reassign existing regions to the new servers. (I'll search the docs for how soon the master performs reassignment when adding new servers) - To utilize the storage space and I/O capacity of new region servers: Choose or combine the following: 1. Automatic major compaction (once a day) 2. Perform major compaction explicitly 3. Use HDFS balancer
Maumau
----- Original Message ----- From: "Ryan Rawson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, May 09, 2010 10:08 AM Subject: Re: How does HBase perform load balancing?
> What I understood from the above is as follows. I'd appreciate if you > could > point out if I am wrong. > > 1. I need to perform major-compaction to unassign regions from the > existing > loaded region servers to a new region server.
This is not so - regions are automatically reassigned with no compaction necessary.
> I cannot reassign the regions just by doing minor compaction and letting > the > non-loaded new server perform major compaction later. Having the loaded > existing server do heavy major compaction is a concern.
this is not what happens, regions are reassigned without requiring any compaction of any kind.
> 2. "no rebalancing required" means that the blocks of HDFS files for > regions > need not be moved from one datanode to another.
So when you add nodes to a new cluster, unless you are running the HDFS balancer, data does not migrate. As HBase naturally compacts tables (once a day by default) it will end up rewriting data and causing its migration. You can help accelerate this process by manually kicking off a compaction for a large table if you have added a lot of new machines.
-
Re: How does HBase perform load balancing?
Ryan Rawson 2010-05-09, 01:36
I dont think the docs say anything about how the balancer works, but it is immediate. As soon as a Regionserver makes the 'report for duty' call back to the master the master will begin reassigning regions. It can take a few for regions to flush then close and move, but it is done fairly efficiently and quickly.
The other points are quite accurate - thanks for the summary. If you are interested in writing blog articles or documentation, we would greatly appreciate it, and can link/embed it on our site/docs. There are only so many hours in the day alas.
Thanks again, -ryan
On Sat, May 8, 2010 at 6:34 PM, MauMau <[EMAIL PROTECTED]> wrote: > Thanks, Ryan, > > - To utilize the CPU and memory of new region servers: > I do not have to do anything. > The master automatically reassign existing regions to the new servers. > (I'll search the docs for how soon the master performs reassignment when > adding new servers) > - To utilize the storage space and I/O capacity of new region servers: > Choose or combine the following: > 1. Automatic major compaction (once a day) > 2. Perform major compaction explicitly > 3. Use HDFS balancer > > Maumau > > ----- Original Message ----- From: "Ryan Rawson" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Sunday, May 09, 2010 10:08 AM > Subject: Re: How does HBase perform load balancing? > >> What I understood from the above is as follows. I'd appreciate if you >> could >> point out if I am wrong. >> >> 1. I need to perform major-compaction to unassign regions from the >> existing >> loaded region servers to a new region server. > > This is not so - regions are automatically reassigned with no > compaction necessary. > >> I cannot reassign the regions just by doing minor compaction and letting >> the >> non-loaded new server perform major compaction later. Having the loaded >> existing server do heavy major compaction is a concern. > > this is not what happens, regions are reassigned without requiring any > compaction of any kind. > >> 2. "no rebalancing required" means that the blocks of HDFS files for >> regions >> need not be moved from one datanode to another. > > So when you add nodes to a new cluster, unless you are running the > HDFS balancer, data does not migrate. As HBase naturally compacts > tables (once a day by default) it will end up rewriting data and > causing its migration. You can help accelerate this process by > manually kicking off a compaction for a large table if you have added > a lot of new machines. > >
-
Re: How does HBase perform load balancing?
Todd Lipcon 2010-05-09, 03:24
On Sat, May 8, 2010 at 6:08 PM, Ryan Rawson <[EMAIL PROTECTED]> wrote:
> > 2. "no rebalancing required" means that the blocks of HDFS files for > regions > > need not be moved from one datanode to another. > > So when you add nodes to a new cluster, unless you are running the > HDFS balancer, data does not migrate. I disagree that the HDFS balancer really helps here. Yes, it will migrate data to the new datanode, but the data that it migrates is not at all tied to the access patterns of HBase (ie it won't choose to migrate blocks that are part of the regions that got migrated to the new regionserver)
-Todd -- Todd Lipcon Software Engineer, Cloudera
|