Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Cannot locate root region

Karthik Ranganathan 2010-01-28, 23:57
Karthik Ranganathan 2010-01-29, 05:30
Joydeep Sarma 2010-01-29, 09:20
Copy link to this message
RE: Cannot locate root region
Kannan Muthukkaruppan 2010-01-29, 16:33
@Joy: The info stored in .META. for various regions as well as in the ephemeral nodes for region servers in zookeeper are both already IP address based. So doesn't look like multi-homing and/or the other flexibilities you mention were a design goal as far as I can tell.

Regarding: <<< doesn't the reverse ip lookup just once at RS startup time?>>>, what seems to be happening is this:

A regionServer periodically sends a regionServerReport (RPC call) to the master. A HServerInfo argument is passed as an argument and it identifies the sending region server's identity in IP address format.

The master, in ServerManager class, maintains a serversToServerInfo map which is hostname based. Every time a master receives a regionServerReport it converts the IP address based name to a hostname via the info.getServerName() call. Normally this call returns the hostname, but we suspect that during the DNS flakiness, it returned an IP address based string. And so, this caused ServerManager.java to think that it was hearing from a new server. And this lead to:

 HServerInfo storedInfo = serversToServerInfo.get(info.getServerName());
    if (storedInfo == null) {
      if (LOG.isDebugEnabled()) {
        LOG.debug("Received report from unknown server -- telling it " +   <<===========          "to " + CALL_SERVER_STARTUP + ": " + info.getServerName());  <<===========      }

and bad things down the road.

The above error message in our logs (example below) indeed identified the host in IP address syntax, even though normally the getServerName call would return the info in hostname format.

2010-01-28 11:21:34,539 DEBUG org.apache.hadoop.hbase.master.ServerManager: Received report from unknown server -- telling it to MSG_CALL_SERVER_STARTUP:,60020,1263605543210

This affected three of our test clusters at the same time!

Perhaps all we need to do is to change the ServerManager's internal maps to all be IP based? That way we avoid/bypass the master having to look up the hostname on every heartbeat.

From: Joydeep Sarma [[EMAIL PROTECTED]]
Sent: Friday, January 29, 2010 1:20 AM
Subject: Re: Cannot locate root region

hadoop also uses the hostnames. if a host is multi-homed - it's
hostname is a better identifier (which still allows it to use
different nics/ips for actual traffic). it can help in the case the
cluster is migrated for example (all the ips change). one could have
the same hostname resolve to different ips depending on who's doing
the lookup (this happens in AWS where the same elastic hostname
resolves to private or public ip depending on where the peer is. so
clients can talk from outside AWS via public ips and master etc. can
talk over private ips).

so lots of reasons i guess. doesn't the reverse ip lookup just once at
RS startup time? (wondering how this reconciles with the  DNS being
flaky after the cluster was up and running).

On Thu, Jan 28, 2010 at 9:30 PM, Karthik Ranganathan
> We did some more digging into this and here is the theory.
> 1. The regionservers use their local ip to lookup their hostnames and pass that to the HMaster. The HMaster finds the server info by using this hostname as the key in the HashMap.
> HRegionServer.java
> reinitialize() -
> this.serverInfo = new HServerInfo(new HServerAddress(
>      new InetSocketAddress(address.getBindAddress(),
>      this.server.getListenerAddress().getPort())), System.currentTimeMillis(),
>      this.conf.getInt("hbase.regionserver.info.port", 60030), machineName);
> In run() -
> HMsg msgs[] = hbaseMaster.regionServerReport(
>              serverInfo, outboundArray, getMostLoadedRegions());
> 2. I have observed in the past that there could be some DNS flakiness which causes the IP address of the machines to be returned as their hostnames. Guessing this is what happened.
> 3. The HMaster looks in the map for the above IP address (masquerading as the server name). It gets and does not find the entry in its map. So it assumes that this is a new region server and issues a CALL_SERVER_STARTUP.
Jean-Daniel Cryans 2010-01-29, 17:39
Joydeep Sarma 2010-01-29, 17:45
Karthik Ranganathan 2010-01-29, 18:44
Joydeep Sarma 2010-01-29, 19:01
Karthik Ranganathan 2010-01-29, 19:19
Jean-Daniel Cryans 2010-01-29, 19:23
Joydeep Sarma 2010-01-29, 19:29
Stack 2010-01-29, 20:08