Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Heterogeneous cluster


+
Jean-Marc Spaggiari 2012-12-08, 03:32
+
Asaf Mesika 2012-12-09, 10:08
+
James Chang 2012-12-08, 15:17
+
Robert Dyer 2012-12-08, 18:38
+
Jean-Marc Spaggiari 2012-12-08, 19:34
+
Michael Segel 2012-12-08, 23:45
+
Robert Dyer 2012-12-08, 23:50
+
Michael Segel 2012-12-09, 08:27
+
Jean-Marc Spaggiari 2012-12-10, 14:03
+
Anoop Sam John 2012-12-11, 04:04
+
Jean-Marc Spaggiari 2012-12-11, 18:48
+
Harsh J 2012-12-11, 20:20
Copy link to this message
-
RE: Heterogeneous cluster

Hi Jean
     Hope you are clear from Harsh's reply.. Thanks Harsh
Pls always keep in mind the 2 layers HBase and under that the HDFS layer where the data actually lies. When u do read HBase tables via MR, the read happens from regions not directly from stored HFiles.  So yes if the job for region1 is running in N2 and region1 is in N1 there will be an RPC to N1 and the DFS client in      N1 in turn may read the data from N1.. So even if the data is replicated in N2 no data locality factor helping you here.

HDFS-2246 introduced the short circuit based read. You can get the detailed explanation of how and when all it will be useful from the below mentioned path.
Also may be better to configure the HBase handled checksum option for better perfromance if you are using 0.94.x version. [This will work only when the read is a short circuited local read]

-Anoop-
________________________________________
From: Harsh J [[EMAIL PROTECTED]]
Sent: Wednesday, December 12, 2012 1:50 AM
To: [EMAIL PROTECTED]
Subject: Re: Heterogeneous cluster

Hi,

On Wed, Dec 12, 2012 at 12:18 AM, Jean-Marc Spaggiari
<[EMAIL PROTECTED]> wrote:
> Hi Anoop,
>
> Thanks for the clarification.
>
> So let's take one example.
>
> Let's say I have 4 nodes and a replication factor set to 3.
>
> I have a region hosted on N1, replicated on N2 and N3. Nothing about
> this region on N4.

The important bit is, pending further enhancements along this line,
"regions" are not replicated. Region's data is replicated on HDFS, but
a Region itself is not replicated. It is served from a single point
(where it is currently assigned). Region data read requests are done
via the RegionServer layer, not directly from DataNodes (from a client
POV).

> It's time to run a MR, and someone need to work on the given region.
> N1 is to busy, so region will be given to another node. Does it mean
> it will be given randomly between N2, N3 and N4?

HBase jobs submit with the split locations for each region being its
current assignee (at time of submission). This gives the "locality".

> If it's given to N4, it's missing an oportunity to get the data almost locally.

If your task gets assigned to any other node or if the region moves
after the job's begun, the data locality of the reads the regionserver
does may easily be affected, yes.

> Also, if the job is given to N2 or N3, are they going to remotly query
> the data over the network from N1? Or are they able to ready it from
> the replicate? Based on what you are saying, seems that they will
> retrieve it for N1. Is there not another oportunity to improve the
> process by reading from the replicated data and not from the master
> one?

As explained above, all reads go through the assigned regionserver. So
the concept of HDFS block replicas can't be applied here yet (I do
know enhancements around this are planned).

> When you are talking about "the short circuit read option", is  this
> something we need to enable as a property? Or it's more like a piece
> of code?

Its configs, and the speed-drug details are at
http://hbase.apache.org/book.html#perf.hdfs section "11.10.2.
Leveraging local data".

--
Harsh J
+
Jean-Marc Spaggiari 2012-12-09, 02:36
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB