Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Regionservers not connecting to master


Copy link to this message
-
RE: Regionservers not connecting to master
Just check out your etc/hosts files.  I have not worked on VMs anyway to
tell the problem more precisely.

Regards
Ram

> -----Original Message-----
> From: Dan Brodsky [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, October 17, 2012 11:05 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Regionservers not connecting to master
>
> Well, slight change: only 1 of the ZK peers happens to work. When a RS
> connects to the other 2, it doesn't go further than that. The 1 ZK
> node that happens to work is the one that runs on the same VM as the
> master.
>
> Sounds like it could be network connectivity issues, so I'm going to
> investigate that a bit further, but other suggestions are welcome.
>
>
> On Wed, Oct 17, 2012 at 1:29 PM, Dan Brodsky <[EMAIL PROTECTED]>
> wrote:
> > Ram,
> >
> > Thanks for your suggestions.
> >
> > The datanodes are all built using the same image, so I know they're
> > all pointed to the same ZK nodes.
> >
> > I monitored all three ZK logs, the master log, and the regionserver
> > log for each RS I was trying to bring back online. I'm glad I have a
> > big screen. :-) Here is what I found:
> >
> > Whenever a regionserver connects to one particular ZK peer *first*,
> it
> > never goes online. The ZK log shows a successful connection
> > negotiating a timeout value, and the RS's log shows a successful ZK
> > connection, but then it just sits there.
> >
> > When a regionserver starts up and connects to one of the other two ZK
> > peers first, it connects to a second one successfully, then contacts
> > the master, and it comes up and all is happy.
> >
> > So the problem of regionservers not connecting to master only happens
> > when the RS tries one particular ZK node as its first ZK connection.
> > But the logs aren't helpful for diagnosing further than that.
> >
> > Additional thoughts?
> >
> >
> > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan
> > <[EMAIL PROTECTED]> wrote:
> >> Can you try like start any of the regionservers that are not
> connecting at
> >> all.  May be start 2 of them.
> >> Observer master logs.  See whether it says
> >> 'Waiting for RegionServers to checkin'?.
> >>
> >> Just to confirm your ZK ip and port is correct thro out the cluster?
> If
> >> multitenant cluster then you may be the other regionservers are
> connecting
> >> to someother ZK cluster?
> >> Wild guess :)
> >>
> >> Regards
> >> Ram
> >>> -----Original Message-----
> >>> From: Dan Brodsky [mailto:[EMAIL PROTECTED]]
> >>> Sent: Wednesday, October 17, 2012 6:31 PM
> >>> To: [EMAIL PROTECTED]
> >>> Subject: Regionservers not connecting to master
> >>>
> >>> Good morning,
> >>>
> >>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus
> three
> >>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK
> >>> peer VM, and one on a third box). All 10 HDFS datanodes are also
> Hbase
> >>> regionservers.
> >>>
> >>> Several weeks ago, we had six HDFS datanodes go offline suddenly
> (with
> >>> no meaningful error messages), and since then, I have been unable
> to
> >>> get all 10 regionservers to connect to the Hbase master. I've tried
> >>> bringing the cluster down and rebooting all the boxes, but no joy.
> The
> >>> machines are all running, and hbase-regionserver appears to start
> >>> normally on each one.
> >>>
> >>> Right now, my master status page (http://namenode:60010) shows 3
> >>> regionservers online. There are also dozens of regions in
> transition
> >>> listed on the status page (in the PENDING_OPEN state), but each of
> >>> those are on one of the regionservers already online.
> >>>
> >>> The 7 other regionservers' log files show a successful connection
> to
> >>> one ZK peer, followed by a regular trail of these messages:
> >>>
> >>> 2012-10-17 12:36:08,394 DEBUG
> >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats:
> total=8.17
> >>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0,
> >>> hitRatio=0cachingAccesses=0, cachingHits=0,