Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Heterogeneous cluster


Copy link to this message
-
RE: Heterogeneous cluster
>But if the job is running there, it can also be
considered as running locally, right? Or will it always be retrieved
from the datanode linked to the RS hosting the region we are dealing
with? Not sure I'm clear :(

Hi Jean,
                 Sorry I have not seen the history of this mailing thread. As far as seeing this question from you, I guess the MR is scanning HTable data, even if the job is running on a replicate I dont think it will be local. The MR job need to fetch the data via HBase only. Means it need to contact the RS hosting the region. Then in turn HBase will contact any of the DN where the data is available.  So it will be multiple steps.  There is nothing like one RS in some way linked to one DN. From which DN the data to be fetched depends on the decision taken by the DFS client. May be it will not contact any DN but will do a local read, if the short circuit read option is enabled and the data is there in the same server where the region is hosted..   I guess I make it clear here.  :)

-Anoop-

________________________________________
From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
Sent: Monday, December 10, 2012 7:33 PM
To: [EMAIL PROTECTED]
Subject: Re: Heterogeneous cluster

@Asaf & Robert: I have posted the code here. But be careful with it.
Read Mike's comment above.
http://www.spaggiari.org/index.php/hbase/changing-the-hbase-default-loadbalancer
I'm a newby on HBase, so you're better to rely on someone more
experienced feedback.

@Mike:

Hi Mike,

I totally agree with your opinion. My balancer is totally a hack on a
'Frankencluster' (BTW, I LOVE this description! Perfect fit!) and a
way for me to take a deeper look at HBase's code.

One question about data locality. When you run an HBase MR, even with
a factor 3 replication, data is considered local only if it's running
on the RS version the region is stored. But does HBase has a way to
see if it can be run on any of the replicats? The replicate might be
on a different rack. But if the job is running there, it can also be
considered as running locally, right? Or will it always be retrieved
from the datanode linked to the RS hosting the region we are dealing
with? Not sure I'm clear :(

JM

2012/12/9, Michael Segel <[EMAIL PROTECTED]>:
> Ok...
>
> From a production/commercial grade answer...
>
> With respect to HBase, you will have 1 live copy and 2 replications.
> (Assuming you didn't change this.) So when you run against HBase, data
> locality becomes less of an issue.
> And again, you have to temper that with that it depends on the number of
> regions within the table...
>
> A lot of people, including committers tend to get hung up on some of the
> details and they tend to lose focus on the larger picture.
>
> If you were running a production cluster and your one node was radically
> different... then you would be better off taking it out of the cluster and
> making it an edge node. (Edge nodes are very important...)
>
> If we're talking about a very large cluster which has evolved... then you
> would want to work out your rack aware placements.  Note that rack aware is
> a logical and not a physical location. So you can modify it to let the
> distro's placement take the hint and move the data.  This is more of a cheat
> and even here... I think that at scale, the potential improvement gains are
> going to be minimal.
>
> This works for everything but HBase.
>
> On that note, it doesn't matter. Again, assume that you have your data
> equally distributed around the cluster and that your access pattern is to
> all nodes in the cluster.  The parallelization in the cluster will average
> out the slow ones.
>
> In terms of your small research clusters...
>
> You're not looking at performance when you build a 'Frankencluster'
>
> Specifically to your case... move all the data to that node and you end up
> with both a networking and disk i/o bottlenecks.
>
> You're worried about the noise.
>
> Having said that...
>
> If you want to improve the balancer code, sure, however, you're going to