Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Heterogeneous cluster


+
Jean-Marc Spaggiari 2012-12-08, 03:32
+
Asaf Mesika 2012-12-09, 10:08
+
James Chang 2012-12-08, 15:17
+
Robert Dyer 2012-12-08, 18:38
+
Jean-Marc Spaggiari 2012-12-08, 19:34
+
Michael Segel 2012-12-08, 23:45
+
Robert Dyer 2012-12-08, 23:50
+
Michael Segel 2012-12-09, 08:27
+
Jean-Marc Spaggiari 2012-12-10, 14:03
+
Anoop Sam John 2012-12-11, 04:04
Copy link to this message
-
Re: Heterogeneous cluster
Jean-Marc Spaggiari 2012-12-11, 18:48
Hi Anoop,

Thanks for the clarification.

So let's take one example.

Let's say I have 4 nodes and a replication factor set to 3.

I have a region hosted on N1, replicated on N2 and N3. Nothing about
this region on N4.

It's time to run a MR, and someone need to work on the given region.
N1 is to busy, so region will be given to another node. Does it mean
it will be given randomly between N2, N3 and N4?

If it's given to N4, it's missing an oportunity to get the data almost locally.

Also, if the job is given to N2 or N3, are they going to remotly query
the data over the network from N1? Or are they able to ready it from
the replicate? Based on what you are saying, seems that they will
retrieve it for N1. Is there not another oportunity to improve the
process by reading from the replicated data and not from the master
one?

When you are talking about "the short circuit read option", is  this
something we need to enable as a property? Or it's more like a piece
of code?

JM

2012/12/10, Anoop Sam John <[EMAIL PROTECTED]>:
>>But if the job is running there, it can also be
> considered as running locally, right? Or will it always be retrieved
> from the datanode linked to the RS hosting the region we are dealing
> with? Not sure I'm clear :(
>
> Hi Jean,
>                  Sorry I have not seen the history of this mailing thread.
> As far as seeing this question from you, I guess the MR is scanning HTable
> data, even if the job is running on a replicate I dont think it will be
> local. The MR job need to fetch the data via HBase only. Means it need to
> contact the RS hosting the region. Then in turn HBase will contact any of
> the DN where the data is available.  So it will be multiple steps.  There is
> nothing like one RS in some way linked to one DN. From which DN the data to
> be fetched depends on the decision taken by the DFS client. May be it will
> not contact any DN but will do a local read, if the short circuit read
> option is enabled and the data is there in the same server where the region
> is hosted..   I guess I make it clear here.  :)
>
> -Anoop-
>
> ________________________________________
> From: Jean-Marc Spaggiari [[EMAIL PROTECTED]]
> Sent: Monday, December 10, 2012 7:33 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Heterogeneous cluster
>
> @Asaf & Robert: I have posted the code here. But be careful with it.
> Read Mike's comment above.
> http://www.spaggiari.org/index.php/hbase/changing-the-hbase-default-loadbalancer
> I'm a newby on HBase, so you're better to rely on someone more
> experienced feedback.
>
> @Mike:
>
> Hi Mike,
>
> I totally agree with your opinion. My balancer is totally a hack on a
> 'Frankencluster' (BTW, I LOVE this description! Perfect fit!) and a
> way for me to take a deeper look at HBase's code.
>
> One question about data locality. When you run an HBase MR, even with
> a factor 3 replication, data is considered local only if it's running
> on the RS version the region is stored. But does HBase has a way to
> see if it can be run on any of the replicats? The replicate might be
> on a different rack. But if the job is running there, it can also be
> considered as running locally, right? Or will it always be retrieved
> from the datanode linked to the RS hosting the region we are dealing
> with? Not sure I'm clear :(
>
> JM
>
> 2012/12/9, Michael Segel <[EMAIL PROTECTED]>:
>> Ok...
>>
>> From a production/commercial grade answer...
>>
>> With respect to HBase, you will have 1 live copy and 2 replications.
>> (Assuming you didn't change this.) So when you run against HBase, data
>> locality becomes less of an issue.
>> And again, you have to temper that with that it depends on the number of
>> regions within the table...
>>
>> A lot of people, including committers tend to get hung up on some of the
>> details and they tend to lose focus on the larger picture.
>>
>> If you were running a production cluster and your one node was radically
>> different... then you would be better off taking it out of the cluster
+
Harsh J 2012-12-11, 20:20
+
Anoop Sam John 2012-12-12, 03:54
+
Jean-Marc Spaggiari 2012-12-09, 02:36