|
|
-
Regionservers not connecting to master
Dan Brodsky 2012-10-17, 13:01
Good morning, I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three Zookeeper quorum peers (one on the namenode, one on a dedicated ZK peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase regionservers. Several weeks ago, we had six HDFS datanodes go offline suddenly (with no meaningful error messages), and since then, I have been unable to get all 10 regionservers to connect to the Hbase master. I've tried bringing the cluster down and rebooting all the boxes, but no joy. The machines are all running, and hbase-regionserver appears to start normally on each one. Right now, my master status page ( http://namenode:60010) shows 3 regionservers online. There are also dozens of regions in transition listed on the status page (in the PENDING_OPEN state), but each of those are on one of the regionservers already online. The 7 other regionservers' log files show a successful connection to one ZK peer, followed by a regular trail of these messages: 2012-10-17 12:36:08,394 DEBUG org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, hitRatio=0cachingAccesses=0, cachingHits=0, cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN If I had to wager a guess, it seems like the 7 offline regionservers are not connecting to other ZK peers, but there isn't anything in the ZK logs to indicate why. Thoughts? Dan
-
RE: Regionservers not connecting to master
Ramkrishna.S.Vasudevan 2012-10-17, 13:12
Can you try like start any of the regionservers that are not connecting at all. May be start 2 of them. Observer master logs. See whether it says 'Waiting for RegionServers to checkin'?. Just to confirm your ZK ip and port is correct thro out the cluster? If multitenant cluster then you may be the other regionservers are connecting to someother ZK cluster? Wild guess :) Regards Ram > -----Original Message----- > From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 17, 2012 6:31 PM > To: [EMAIL PROTECTED] > Subject: Regionservers not connecting to master > > Good morning, > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase > regionservers. > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with > no meaningful error messages), and since then, I have been unable to > get all 10 regionservers to connect to the Hbase master. I've tried > bringing the cluster down and rebooting all the boxes, but no joy. The > machines are all running, and hbase-regionserver appears to start > normally on each one. > > Right now, my master status page ( http://namenode:60010) shows 3 > regionservers online. There are also dozens of regions in transition > listed on the status page (in the PENDING_OPEN state), but each of > those are on one of the regionservers already online. > > The 7 other regionservers' log files show a successful connection to > one ZK peer, followed by a regular trail of these messages: > > 2012-10-17 12:36:08,394 DEBUG > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > hitRatio=0cachingAccesses=0, cachingHits=0, > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN > > If I had to wager a guess, it seems like the 7 offline regionservers > are not connecting to other ZK peers, but there isn't anything in the > ZK logs to indicate why. > > Thoughts? > > Dan
-
Re: Regionservers not connecting to master
Dan Brodsky 2012-10-17, 17:29
Ram, Thanks for your suggestions. The datanodes are all built using the same image, so I know they're all pointed to the same ZK nodes. I monitored all three ZK logs, the master log, and the regionserver log for each RS I was trying to bring back online. I'm glad I have a big screen. :-) Here is what I found: Whenever a regionserver connects to one particular ZK peer *first*, it never goes online. The ZK log shows a successful connection negotiating a timeout value, and the RS's log shows a successful ZK connection, but then it just sits there. When a regionserver starts up and connects to one of the other two ZK peers first, it connects to a second one successfully, then contacts the master, and it comes up and all is happy. So the problem of regionservers not connecting to master only happens when the RS tries one particular ZK node as its first ZK connection. But the logs aren't helpful for diagnosing further than that. Additional thoughts? On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan <[EMAIL PROTECTED]> wrote: > Can you try like start any of the regionservers that are not connecting at > all. May be start 2 of them. > Observer master logs. See whether it says > 'Waiting for RegionServers to checkin'?. > > Just to confirm your ZK ip and port is correct thro out the cluster? If > multitenant cluster then you may be the other regionservers are connecting > to someother ZK cluster? > Wild guess :) > > Regards > Ram >> -----Original Message----- >> From: Dan Brodsky [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, October 17, 2012 6:31 PM >> To: [EMAIL PROTECTED] >> Subject: Regionservers not connecting to master >> >> Good morning, >> >> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three >> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK >> peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase >> regionservers. >> >> Several weeks ago, we had six HDFS datanodes go offline suddenly (with >> no meaningful error messages), and since then, I have been unable to >> get all 10 regionservers to connect to the Hbase master. I've tried >> bringing the cluster down and rebooting all the boxes, but no joy. The >> machines are all running, and hbase-regionserver appears to start >> normally on each one. >> >> Right now, my master status page ( http://namenode:60010) shows 3 >> regionservers online. There are also dozens of regions in transition >> listed on the status page (in the PENDING_OPEN state), but each of >> those are on one of the regionservers already online. >> >> The 7 other regionservers' log files show a successful connection to >> one ZK peer, followed by a regular trail of these messages: >> >> 2012-10-17 12:36:08,394 DEBUG >> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 >> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, >> hitRatio=0cachingAccesses=0, cachingHits=0, >> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN >> >> If I had to wager a guess, it seems like the 7 offline regionservers >> are not connecting to other ZK peers, but there isn't anything in the >> ZK logs to indicate why. >> >> Thoughts? >> >> Dan >
-
Re: Regionservers not connecting to master
Dan Brodsky 2012-10-17, 17:35
Well, slight change: only 1 of the ZK peers happens to work. When a RS connects to the other 2, it doesn't go further than that. The 1 ZK node that happens to work is the one that runs on the same VM as the master. Sounds like it could be network connectivity issues, so I'm going to investigate that a bit further, but other suggestions are welcome. On Wed, Oct 17, 2012 at 1:29 PM, Dan Brodsky <[EMAIL PROTECTED]> wrote: > Ram, > > Thanks for your suggestions. > > The datanodes are all built using the same image, so I know they're > all pointed to the same ZK nodes. > > I monitored all three ZK logs, the master log, and the regionserver > log for each RS I was trying to bring back online. I'm glad I have a > big screen. :-) Here is what I found: > > Whenever a regionserver connects to one particular ZK peer *first*, it > never goes online. The ZK log shows a successful connection > negotiating a timeout value, and the RS's log shows a successful ZK > connection, but then it just sits there. > > When a regionserver starts up and connects to one of the other two ZK > peers first, it connects to a second one successfully, then contacts > the master, and it comes up and all is happy. > > So the problem of regionservers not connecting to master only happens > when the RS tries one particular ZK node as its first ZK connection. > But the logs aren't helpful for diagnosing further than that. > > Additional thoughts? > > > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan > <[EMAIL PROTECTED]> wrote: >> Can you try like start any of the regionservers that are not connecting at >> all. May be start 2 of them. >> Observer master logs. See whether it says >> 'Waiting for RegionServers to checkin'?. >> >> Just to confirm your ZK ip and port is correct thro out the cluster? If >> multitenant cluster then you may be the other regionservers are connecting >> to someother ZK cluster? >> Wild guess :) >> >> Regards >> Ram >>> -----Original Message----- >>> From: Dan Brodsky [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, October 17, 2012 6:31 PM >>> To: [EMAIL PROTECTED] >>> Subject: Regionservers not connecting to master >>> >>> Good morning, >>> >>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three >>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK >>> peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase >>> regionservers. >>> >>> Several weeks ago, we had six HDFS datanodes go offline suddenly (with >>> no meaningful error messages), and since then, I have been unable to >>> get all 10 regionservers to connect to the Hbase master. I've tried >>> bringing the cluster down and rebooting all the boxes, but no joy. The >>> machines are all running, and hbase-regionserver appears to start >>> normally on each one. >>> >>> Right now, my master status page ( http://namenode:60010) shows 3 >>> regionservers online. There are also dozens of regions in transition >>> listed on the status page (in the PENDING_OPEN state), but each of >>> those are on one of the regionservers already online. >>> >>> The 7 other regionservers' log files show a successful connection to >>> one ZK peer, followed by a regular trail of these messages: >>> >>> 2012-10-17 12:36:08,394 DEBUG >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 >>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, >>> hitRatio=0cachingAccesses=0, cachingHits=0, >>> cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN >>> >>> If I had to wager a guess, it seems like the 7 offline regionservers >>> are not connecting to other ZK peers, but there isn't anything in the >>> ZK logs to indicate why. >>> >>> Thoughts? >>> >>> Dan >>
-
RE: Regionservers not connecting to master
Ramkrishna.S.Vasudevan 2012-10-18, 04:25
Just check out your etc/hosts files. I have not worked on VMs anyway to tell the problem more precisely. Regards Ram > -----Original Message----- > From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 17, 2012 11:05 PM > To: [EMAIL PROTECTED] > Subject: Re: Regionservers not connecting to master > > Well, slight change: only 1 of the ZK peers happens to work. When a RS > connects to the other 2, it doesn't go further than that. The 1 ZK > node that happens to work is the one that runs on the same VM as the > master. > > Sounds like it could be network connectivity issues, so I'm going to > investigate that a bit further, but other suggestions are welcome. > > > On Wed, Oct 17, 2012 at 1:29 PM, Dan Brodsky <[EMAIL PROTECTED]> > wrote: > > Ram, > > > > Thanks for your suggestions. > > > > The datanodes are all built using the same image, so I know they're > > all pointed to the same ZK nodes. > > > > I monitored all three ZK logs, the master log, and the regionserver > > log for each RS I was trying to bring back online. I'm glad I have a > > big screen. :-) Here is what I found: > > > > Whenever a regionserver connects to one particular ZK peer *first*, > it > > never goes online. The ZK log shows a successful connection > > negotiating a timeout value, and the RS's log shows a successful ZK > > connection, but then it just sits there. > > > > When a regionserver starts up and connects to one of the other two ZK > > peers first, it connects to a second one successfully, then contacts > > the master, and it comes up and all is happy. > > > > So the problem of regionservers not connecting to master only happens > > when the RS tries one particular ZK node as its first ZK connection. > > But the logs aren't helpful for diagnosing further than that. > > > > Additional thoughts? > > > > > > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan > > <[EMAIL PROTECTED]> wrote: > >> Can you try like start any of the regionservers that are not > connecting at > >> all. May be start 2 of them. > >> Observer master logs. See whether it says > >> 'Waiting for RegionServers to checkin'?. > >> > >> Just to confirm your ZK ip and port is correct thro out the cluster? > If > >> multitenant cluster then you may be the other regionservers are > connecting > >> to someother ZK cluster? > >> Wild guess :) > >> > >> Regards > >> Ram > >>> -----Original Message----- > >>> From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > >>> Sent: Wednesday, October 17, 2012 6:31 PM > >>> To: [EMAIL PROTECTED] > >>> Subject: Regionservers not connecting to master > >>> > >>> Good morning, > >>> > >>> I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus > three > >>> Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > >>> peer VM, and one on a third box). All 10 HDFS datanodes are also > Hbase > >>> regionservers. > >>> > >>> Several weeks ago, we had six HDFS datanodes go offline suddenly > (with > >>> no meaningful error messages), and since then, I have been unable > to > >>> get all 10 regionservers to connect to the Hbase master. I've tried > >>> bringing the cluster down and rebooting all the boxes, but no joy. > The > >>> machines are all running, and hbase-regionserver appears to start > >>> normally on each one. > >>> > >>> Right now, my master status page ( http://namenode:60010) shows 3 > >>> regionservers online. There are also dozens of regions in > transition > >>> listed on the status page (in the PENDING_OPEN state), but each of > >>> those are on one of the regionservers already online. > >>> > >>> The 7 other regionservers' log files show a successful connection > to > >>> one ZK peer, followed by a regular trail of these messages: > >>> > >>> 2012-10-17 12:36:08,394 DEBUG > >>> org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: > total=8.17 > >>> MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > >>> hitRatio=0cachingAccesses=0, cachingHits=0,
-
Re: Regionservers not connecting to master
Dan Brodsky 2012-11-02, 18:13
Ram, I wanted to follow up with you since you helped me with your below comment. It turns out that the ZK configuration files somehow got changed (reverted to their default values?), and I'm not sure who/when/how. The zoo.cfg files didn't have the list of quorum peers, and the myid files that told each ZK peer their ordinal value had been deleted. So, effectively, I had three ZK standalone servers, instead of one quorum. Problem fixed, Hbase is happy again. Cheers, Dan On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan < [EMAIL PROTECTED]> wrote: > Can you try like start any of the regionservers that are not connecting at > all. May be start 2 of them. > Observer master logs. See whether it says > 'Waiting for RegionServers to checkin'?. > > Just to confirm your ZK ip and port is correct thro out the cluster? If > multitenant cluster then you may be the other regionservers are connecting > to someother ZK cluster? > Wild guess :) > > Regards > Ram > > -----Original Message----- > > From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, October 17, 2012 6:31 PM > > To: [EMAIL PROTECTED] > > Subject: Regionservers not connecting to master > > > > Good morning, > > > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase > > regionservers. > > > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with > > no meaningful error messages), and since then, I have been unable to > > get all 10 regionservers to connect to the Hbase master. I've tried > > bringing the cluster down and rebooting all the boxes, but no joy. The > > machines are all running, and hbase-regionserver appears to start > > normally on each one. > > > > Right now, my master status page ( http://namenode:60010) shows 3 > > regionservers online. There are also dozens of regions in transition > > listed on the status page (in the PENDING_OPEN state), but each of > > those are on one of the regionservers already online. > > > > The 7 other regionservers' log files show a successful connection to > > one ZK peer, followed by a regular trail of these messages: > > > > 2012-10-17 12:36:08,394 DEBUG > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > > hitRatio=0cachingAccesses=0, cachingHits=0, > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN > > > > If I had to wager a guess, it seems like the 7 offline regionservers > > are not connecting to other ZK peers, but there isn't anything in the > > ZK logs to indicate why. > > > > Thoughts? > > > > Dan > >
-
Re: Regionservers not connecting to master
Kevin O'dell 2012-11-02, 18:22
Do you use Puppet? On Fri, Nov 2, 2012 at 1:13 PM, Dan Brodsky <[EMAIL PROTECTED]> wrote: > Ram, > > I wanted to follow up with you since you helped me with your below comment. > > It turns out that the ZK configuration files somehow got changed (reverted > to their default values?), and I'm not sure who/when/how. The zoo.cfg files > didn't have the list of quorum peers, and the myid files that told each ZK > peer their ordinal value had been deleted. So, effectively, I had three ZK > standalone servers, instead of one quorum. > > Problem fixed, Hbase is happy again. > > Cheers, > > Dan > > > > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan < > [EMAIL PROTECTED]> wrote: > > > Can you try like start any of the regionservers that are not connecting > at > > all. May be start 2 of them. > > Observer master logs. See whether it says > > 'Waiting for RegionServers to checkin'?. > > > > Just to confirm your ZK ip and port is correct thro out the cluster? If > > multitenant cluster then you may be the other regionservers are > connecting > > to someother ZK cluster? > > Wild guess :) > > > > Regards > > Ram > > > -----Original Message----- > > > From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > > > Sent: Wednesday, October 17, 2012 6:31 PM > > > To: [EMAIL PROTECTED] > > > Subject: Regionservers not connecting to master > > > > > > Good morning, > > > > > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three > > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > > > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase > > > regionservers. > > > > > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with > > > no meaningful error messages), and since then, I have been unable to > > > get all 10 regionservers to connect to the Hbase master. I've tried > > > bringing the cluster down and rebooting all the boxes, but no joy. The > > > machines are all running, and hbase-regionserver appears to start > > > normally on each one. > > > > > > Right now, my master status page ( http://namenode:60010) shows 3 > > > regionservers online. There are also dozens of regions in transition > > > listed on the status page (in the PENDING_OPEN state), but each of > > > those are on one of the regionservers already online. > > > > > > The 7 other regionservers' log files show a successful connection to > > > one ZK peer, followed by a regular trail of these messages: > > > > > > 2012-10-17 12:36:08,394 DEBUG > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 > > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > > > hitRatio=0cachingAccesses=0, cachingHits=0, > > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN > > > > > > If I had to wager a guess, it seems like the 7 offline regionservers > > > are not connecting to other ZK peers, but there isn't anything in the > > > ZK logs to indicate why. > > > > > > Thoughts? > > > > > > Dan > > > > > -- Kevin O'Dell Customer Operations Engineer, Cloudera
-
Re: Regionservers not connecting to master
Dan Brodsky 2012-11-02, 18:28
Nope. I'm honestly not sure how the files changed, but I will keep an eye on it. On Fri, Nov 2, 2012 at 2:22 PM, Kevin O'dell <[EMAIL PROTECTED]>wrote: > Do you use Puppet? > > On Fri, Nov 2, 2012 at 1:13 PM, Dan Brodsky <[EMAIL PROTECTED]> wrote: > > > Ram, > > > > I wanted to follow up with you since you helped me with your below > comment. > > > > It turns out that the ZK configuration files somehow got changed > (reverted > > to their default values?), and I'm not sure who/when/how. The zoo.cfg > files > > didn't have the list of quorum peers, and the myid files that told each > ZK > > peer their ordinal value had been deleted. So, effectively, I had three > ZK > > standalone servers, instead of one quorum. > > > > Problem fixed, Hbase is happy again. > > > > Cheers, > > > > Dan > > > > > > > > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan < > > [EMAIL PROTECTED]> wrote: > > > > > Can you try like start any of the regionservers that are not connecting > > at > > > all. May be start 2 of them. > > > Observer master logs. See whether it says > > > 'Waiting for RegionServers to checkin'?. > > > > > > Just to confirm your ZK ip and port is correct thro out the cluster? If > > > multitenant cluster then you may be the other regionservers are > > connecting > > > to someother ZK cluster? > > > Wild guess :) > > > > > > Regards > > > Ram > > > > -----Original Message----- > > > > From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > > > > Sent: Wednesday, October 17, 2012 6:31 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Regionservers not connecting to master > > > > > > > > Good morning, > > > > > > > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three > > > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > > > > peer VM, and one on a third box). All 10 HDFS datanodes are also > Hbase > > > > regionservers. > > > > > > > > Several weeks ago, we had six HDFS datanodes go offline suddenly > (with > > > > no meaningful error messages), and since then, I have been unable to > > > > get all 10 regionservers to connect to the Hbase master. I've tried > > > > bringing the cluster down and rebooting all the boxes, but no joy. > The > > > > machines are all running, and hbase-regionserver appears to start > > > > normally on each one. > > > > > > > > Right now, my master status page ( http://namenode:60010) shows 3 > > > > regionservers online. There are also dozens of regions in transition > > > > listed on the status page (in the PENDING_OPEN state), but each of > > > > those are on one of the regionservers already online. > > > > > > > > The 7 other regionservers' log files show a successful connection to > > > > one ZK peer, followed by a regular trail of these messages: > > > > > > > > 2012-10-17 12:36:08,394 DEBUG > > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 > > > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > > > > hitRatio=0cachingAccesses=0, cachingHits=0, > > > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN > > > > > > > > If I had to wager a guess, it seems like the 7 offline regionservers > > > > are not connecting to other ZK peers, but there isn't anything in the > > > > ZK logs to indicate why. > > > > > > > > Thoughts? > > > > > > > > Dan > > > > > > > > > > > > -- > Kevin O'Dell > Customer Operations Engineer, Cloudera >
-
Re: Regionservers not connecting to master
ramkrishna vasudevan 2012-11-03, 17:01
Nice...Thanks for your follow up. Regards Ram On Fri, Nov 2, 2012 at 11:43 PM, Dan Brodsky <[EMAIL PROTECTED]> wrote: > Ram, > > I wanted to follow up with you since you helped me with your below comment. > > It turns out that the ZK configuration files somehow got changed (reverted > to their default values?), and I'm not sure who/when/how. The zoo.cfg files > didn't have the list of quorum peers, and the myid files that told each ZK > peer their ordinal value had been deleted. So, effectively, I had three ZK > standalone servers, instead of one quorum. > > Problem fixed, Hbase is happy again. > > Cheers, > > Dan > > > > On Wed, Oct 17, 2012 at 9:12 AM, Ramkrishna.S.Vasudevan < > [EMAIL PROTECTED]> wrote: > > > Can you try like start any of the regionservers that are not connecting > at > > all. May be start 2 of them. > > Observer master logs. See whether it says > > 'Waiting for RegionServers to checkin'?. > > > > Just to confirm your ZK ip and port is correct thro out the cluster? If > > multitenant cluster then you may be the other regionservers are > connecting > > to someother ZK cluster? > > Wild guess :) > > > > Regards > > Ram > > > -----Original Message----- > > > From: Dan Brodsky [mailto:[EMAIL PROTECTED]] > > > Sent: Wednesday, October 17, 2012 6:31 PM > > > To: [EMAIL PROTECTED] > > > Subject: Regionservers not connecting to master > > > > > > Good morning, > > > > > > I have a 10 node Hadoop/Hbase cluster, plus a namenode VM, plus three > > > Zookeeper quorum peers (one on the namenode, one on a dedicated ZK > > > peer VM, and one on a third box). All 10 HDFS datanodes are also Hbase > > > regionservers. > > > > > > Several weeks ago, we had six HDFS datanodes go offline suddenly (with > > > no meaningful error messages), and since then, I have been unable to > > > get all 10 regionservers to connect to the Hbase master. I've tried > > > bringing the cluster down and rebooting all the boxes, but no joy. The > > > machines are all running, and hbase-regionserver appears to start > > > normally on each one. > > > > > > Right now, my master status page ( http://namenode:60010) shows 3 > > > regionservers online. There are also dozens of regions in transition > > > listed on the status page (in the PENDING_OPEN state), but each of > > > those are on one of the regionservers already online. > > > > > > The 7 other regionservers' log files show a successful connection to > > > one ZK peer, followed by a regular trail of these messages: > > > > > > 2012-10-17 12:36:08,394 DEBUG > > > org.apache.hadoop.hbase.io.hfile.LruBlockCache: LRU Stats: total=8.17 > > > MB, free=987.67 MB, max=995.84 MB, blocks=0, accesses=0, hits=0, > > > hitRatio=0cachingAccesses=0, cachingHits=0, > > > cachingHitsRatio=0evictions=0, evicted=0, evictedPerRun=NaN > > > > > > If I had to wager a guess, it seems like the 7 offline regionservers > > > are not connecting to other ZK peers, but there isn't anything in the > > > ZK logs to indicate why. > > > > > > Thoughts? > > > > > > Dan > > > > >
|
|