Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Heterogeneous cluster


Copy link to this message
-
Re: Heterogeneous cluster
Robert Dyer 2012-12-08, 23:50
I of course can not speak for Jean-Marc, however my use case is not very
corporate.  It is a small cluster (9 nodes) and only 1 of those nodes is
different (drastically different).

And yes, I configured it so that node has a lot more map slots.  However,
the problem is HBase balances without regard to that and thus even though
more map tasks run on those nodes they are not data-local!  If I have a
balancer that is able to keep more regions on that particular node, then
the data locality of my map tasks is improved.
On Sat, Dec 8, 2012 at 5:45 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> Take what I say with a grain of kosher salt. (Its what they put on your
> drink glasses because the grains are bigger. ;-)
>
> I think what you are doing is cool hack, however in the bigger picture,
> you shouldn't have to do this with your load balancer. Also it doesn't
> matter if you think about ti.
>
> With a heterogenous cluster, you will not share the same configuration
> across all machines in the cluster. You will change the number of slots per
> node based on its capacity.
> That will limit what amount of work could be done on the same cluster.
>
> You could also consider playing with the rack aware aspects of your
> cluster.
> You could make all of your 2CPU machines in the same rack.
>
> In theory... machine, rack , second rack is how the data is distributed.
> In theory if the 2CPU cores are neighbors, then the 2nd and or 3rd copy
> goes to another machine.
>
> Trying to write a custom balancer, may be a good hack, but not good in
> terms of corporate life.
>
> Just saying!
>
> -Mike
>
> On Dec 8, 2012, at 1:34 PM, Jean-Marc Spaggiari <[EMAIL PROTECTED]>
> wrote:
>
> > Hi,
> >
> > It's not yet available anywhere. I will post it today or tomorrow,
> > just the time to remove some hardcoding I did into it ;) It's a quick
> > and dirty PerformanceBalancer. It's not a CPULoadBalencer.
> >
> > Anyway, I will give more details over the week-end, but there is
> > absolutly nothing extraordinaire with it.
> >
> > JM
> >
> > 2012/12/8, Robert Dyer <[EMAIL PROTECTED]>:
> >> I too am interested in this custom load balancer, as I was actually just
> >> starting to look into writing one that does the same thing for
> >> my heterogeneous cluster!
> >>
> >> Is this available somewhere?
> >>
> >> On Sat, Dec 8, 2012 at 9:17 AM, James Chang <[EMAIL PROTECTED]>
> >> wrote:
> >>
> >>>     By the way, I saw you mentioned that you
> >>> have built a "LoadBalancer", could you kindly
> >>> share some detailed info about it?
> >>>
> >>> Jean-Marc Spaggiari 於 2012年12月8日星期六寫道:
> >>>
> >>>> Hi,
> >>>>
> >>>> Here is the situation.
> >>>>
> >>>> I have an heterogeneous cluster with 2 cores CPUs, 4 cores CPUs and 8
> >>>> cores CPUs servers. The performances of those different servers allow
> >>>> them to handle different size of load. So far, I built a LoadBalancer
> >>>> which balance the regions over those servers based on the
> >>>> performances. And it’s working quite well. The RowCounter went down
> >>>> from 11 minutes to 6 minutes. However, I can still see that the tasks
> >>>> are run on some servers accessing data on other servers, which
> >>>> overwhelme the bandwidth and slow done the process since some 2 cores
> >>>> servers are assigned to count some rows hosted on 8 cores servers.
> >>>>
> >>>> I’m looking for a way to “force” the tasks to run on the servers where
> >>>> the regions are assigned.
> >>>>
> >>>> I first tried to reject the tasks on the Mapper setup method when the
> >>>> data was not local to see if the tracker will assign it to another
> >>>> server. No. It’s just failing and mostly not re-assigned. I tried
> >>>> IOExceptions, RuntimeExceptions, InterruptionExceptions with no
> >>>> success.
> >>>>
> >>>> So now I have 3 possible options.
> >>>>
> >>>> The first one is to move from the MapReduce to the Coprocessor
> >>>> EndPoint. Running locally on the RegionServer, it’s accessing only the
> >>>> local data and I can manually reject all what is not local. Therefor

Robert Dyer
[EMAIL PROTECTED]