Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Heterogeneous cluster

Jean-Marc Spaggiari 2012-12-08, 03:32
Asaf Mesika 2012-12-09, 10:08
James Chang 2012-12-08, 15:17
Robert Dyer 2012-12-08, 18:38
Jean-Marc Spaggiari 2012-12-08, 19:34
Michael Segel 2012-12-08, 23:45
Robert Dyer 2012-12-08, 23:50
Copy link to this message
Re: Heterogeneous cluster
Michael Segel 2012-12-09, 08:27

From a production/commercial grade answer...

With respect to HBase, you will have 1 live copy and 2 replications. (Assuming you didn't change this.) So when you run against HBase, data locality becomes less of an issue.
And again, you have to temper that with that it depends on the number of regions within the table...

A lot of people, including committers tend to get hung up on some of the details and they tend to lose focus on the larger picture.  

If you were running a production cluster and your one node was radically different... then you would be better off taking it out of the cluster and making it an edge node. (Edge nodes are very important...)

If we're talking about a very large cluster which has evolved... then you would want to work out your rack aware placements.  Note that rack aware is a logical and not a physical location. So you can modify it to let the distro's placement take the hint and move the data.  This is more of a cheat and even here... I think that at scale, the potential improvement gains are going to be minimal.

This works for everything but HBase.

On that note, it doesn't matter. Again, assume that you have your data equally distributed around the cluster and that your access pattern is to all nodes in the cluster.  The parallelization in the cluster will average out the slow ones.

In terms of your small research clusters...

You're not looking at performance when you build a 'Frankencluster'

Specifically to your case... move all the data to that node and you end up with both a networking and disk i/o bottlenecks.

You're worried about the noise.

Having said that...

If you want to improve the balancer code, sure, however, you're going to need to do some work where you capture your cluster's statistics so that the balancer has more intelligence.

You may start off wanting to allow HBase to take hints about the cluster, but in truth, I don't think its a good idea. Note, I realize that you and Jean-Marc are not suggesting that it is your intent to add something like this, but that someone will create a JIRA and then someone else may act upon it....

IMHO, that's a lot of work, adding intelligence to the HBase Scheduler and I don't think it will really make a difference in terms of overall performance.
Just saying...


On Dec 8, 2012, at 5:50 PM, Robert Dyer <[EMAIL PROTECTED]> wrote:

> I of course can not speak for Jean-Marc, however my use case is not very
> corporate.  It is a small cluster (9 nodes) and only 1 of those nodes is
> different (drastically different).
> And yes, I configured it so that node has a lot more map slots.  However,
> the problem is HBase balances without regard to that and thus even though
> more map tasks run on those nodes they are not data-local!  If I have a
> balancer that is able to keep more regions on that particular node, then
> the data locality of my map tasks is improved.
> On Sat, Dec 8, 2012 at 5:45 PM, Michael Segel <[EMAIL PROTECTED]>wrote:
>> Take what I say with a grain of kosher salt. (Its what they put on your
>> drink glasses because the grains are bigger. ;-)
>> I think what you are doing is cool hack, however in the bigger picture,
>> you shouldn't have to do this with your load balancer. Also it doesn't
>> matter if you think about ti.
>> With a heterogenous cluster, you will not share the same configuration
>> across all machines in the cluster. You will change the number of slots per
>> node based on its capacity.
>> That will limit what amount of work could be done on the same cluster.
>> You could also consider playing with the rack aware aspects of your
>> cluster.
>> You could make all of your 2CPU machines in the same rack.
>> In theory... machine, rack , second rack is how the data is distributed.
>> In theory if the 2CPU cores are neighbors, then the 2nd and or 3rd copy
>> goes to another machine.
>> Trying to write a custom balancer, may be a good hack, but not good in
Jean-Marc Spaggiari 2012-12-10, 14:03
Anoop Sam John 2012-12-11, 04:04
Jean-Marc Spaggiari 2012-12-11, 18:48
Harsh J 2012-12-11, 20:20
Anoop Sam John 2012-12-12, 03:54
Jean-Marc Spaggiari 2012-12-09, 02:36