Thanks for your response. I know that HBase chooses to support C+P. Is
there any report about availability of HBase?
I am still confusing who is responsible for assigning new
RegionServer, Zookeeper or HMaster? Which one contains the information
(metadata) of data replication? Or, HMaster or Zookeeper has to
contact with NameNode of HDFS to obtain data replication information?
On Sat, Dec 10, 2011 at 12:00 AM, lars hofhansl <[EMAIL PROTECTED]> wrote:
> Hi Yong,
> HBase trades availability for consistencies (please see recent thread on this list: "HBase and Consistency in CAP")
> RegionServers read data from HDFS, if you don't consider performance implications from data locality for a moment then
> it does not matter from where the RegionServer reads the data.
> You can look at it this way: Datanodes (HDFS) handle the distribution of the data, RegionServer handle distribution of CPU load and
> control access to the data.
> Hotspotting is a problem. You deal with it by avoiding it :), i.e. distribute the data in a way that does not lead to hotspotting (by choosing proper row keys).
> Hot regions can be split, etc.
> According to documentation found here https://issues.apache.org/jira/browse/HDFS-265, hflush only returns to client when all nodes
> in the pipeline have sync'ed the data.
> -- Lars
> ----- Original Message -----
> From: yonghu <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Sent: Friday, December 9, 2011 11:40 AM
> Subject: availability and data replica issues of HBase
> I read some discussions from the mail-list. It mentions the read and
> write operations for the same data object will be routed into the same
> RegionServer. This strategy can guarantee data consistency. But, how
> about availability? If this RegionServer is down or temporarily not
> available, the master will assign a new RegionServer for processing
> data request or just wait until that RegionServer comes back? If mater
> assigns new RegionServer, how can new RegionServer obtains data?
> The other issue is about work-balance. If a huge amount of read and
> write operations only apply on a small set of data, one RegionServer
> may become a hot-spot. How HBase deal with this problems?
> The last question is about data replica. The HBase data is still
> stored in HDFS. HDFS will use eager synchronization (pipelining) tot
> synchronize all replicas. If HBase write data into HDFS, when should
> HDFS return the write finishing acknowledge to HBase, just waiting
> until one replica update or until all replicas update?