|
Takahiko Kawasaki
2012-10-30, 11:10
Steve Loughran
2012-10-30, 17:06
Harsh J
2012-10-30, 18:14
Todd Lipcon
2012-10-30, 18:23
Todd Lipcon
2012-10-30, 20:16
|
-
DataNodes fail to send heartbeat to HA-enabled NameNodeTakahiko Kawasaki 2012-10-30, 11:10
Hello,
I have trouble in quorum-based HDFS HA of CDH 4.1.1. NameNode Web UI of Cloudera Manager reports NameNode status. Its has "Cluster Summary" section and my cluster is summarized there like below. --- Cluster Summary --- Configured Capacity : 0 KB DFS Used : 0 KB Non DFS Used : 0 KB DFS Remaining : 0 KB DFS Used% : 100 % DFS Remaining% : 0 % Block Pool Used : 0 KB Block Pool Used% : 100 % DataNodes usages : Min % Median % Max % stdev % 0 % 0 % 0 % 0 % Live Nodes : 0 (Decommissioned: 0) Dead Nodes : 5 (Decommissioned: 0) Decommissioning Nodes : 0 -------------------- As you can see, all the DataNodes are regarded as dead. I found DataNodes continued to emit logs about failure to send heartbeat to NameNode. ---- DataNode Log (host names were manually edited) --- 2012-10-30 19:28:16,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: 0msec; heartBeatInterval=3000 2012-10-30 19:28:16,817 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in BPOfferService for Block pool BP-2063217961-192.168.62.231-1351263110470 (storage id DS-2090122187-192.168.62.233-50010-1338981658216) service to node02.example.com/192.168.62.232:8020 java.lang.NullPointerException at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) at java.lang.Thread.run(Thread.java:662) -------------------- So, I guess that DataNodes are failing to locate the name service for some reasons, but I don't have any clue to solve the problem. I confirmed that /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml of a DataNode contains --- core-site.xml --- <property> <name>fs.defaultFS</name> <value>hdfs://nameservice1</value> </property> -------------------- and hdfs-site.xml contains --- hdfs-site.xml --- <property> <name>dfs.nameservices</name> <value>nameservice1</value> </property> <property> <name>dfs.client.failover.proxy.provider.nameservice1</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.namenodes.nameservice1</name> <value>namenode38,namenode90</value> </property> <property> <name>dfs.namenode.rpc-address.nameservice1.namenode38</name> <value>node01.example.com:8020</value> </property> <property> <name>dfs.namenode.http-address.nameservice1.namenode38</name> <value>node01.example.com:50070</value> </property> <property> <name>dfs.namenode.https-address.nameservice1.namenode38</name> <value>node01.example.com:50470</value> </property> <property> <name>dfs.namenode.rpc-address.nameservice1.namenode90</name> <value>node02.example.com:8020</value> </property> <property> <name>dfs.namenode.http-address.nameservice1.namenode90</name> <value>node02.example.com:50070</value> </property> <property> <name>dfs.namenode.https-address.nameservice1.namenode90</name> <value>jbmnode02.jibemobile.jp:50470</value> </property> <property> <name>dfs.permissions.superusergroup</name> <value>supergroup</value> </property> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.namenode.replication.min</name> <value>1</value> </property> <property> <name>dfs.replication.max</name> <value>512</value> </property> -------------------- The following was my trial to create a file in HDFS but in vain. -------------------- # vi /tmp/test.txt # sudo -u hdfs hadoop fs -mkdir /takahiko # sudo -u hdfs hadoop fs -ls / Found 3 items drwxr-xr-x - hbase hbase 0 2012-10-30 15:12 /hbase drwxr-xr-x - hdfs supergroup 0 2012-10-30 18:55 /takahiko drwxrwxrwt - hdfs hdfs 0 2012-10-26 23:58 /tmp # sudo -u hdfs hadoop fs -copyFromLocal /tmp/test.txt /takahiko/ 12/10/30 20:07:05 WARN hdfs.DFSClient: DataStreamer Exception org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /takahiko/test.txt._COPYING_ could only be replicated to 0 nodes instead of minReplication (=1). There are 0 datanode(s) running and no node(s) are excluded in this operation. at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget(BlockManager.java:1322) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:2170) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:471) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:297) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java:44080) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:453) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:898) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1693) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1689) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1332) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1687) at org.apache.hadoop.ipc.Client.call(Client.java:1160) at org.apache.hadoo
-
Re: DataNodes fail to send heartbeat to HA-enabled NameNodeSteve Loughran 2012-10-30, 17:06
On 30 October 2012 11:10, Takahiko Kawasaki <[EMAIL PROTECTED]> wrote:
> Hello, > > I have trouble in quorum-based HDFS HA of CDH 4.1.1. > > 2012-10-30 19:28:16,817 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > BPOfferService for Block pool > BP-2063217961-192.168.62.231-1351263110470 (storage id > DS-2090122187-192.168.62.233-50010-1338981658216) service to > node02.example.com/192.168.62.232:8020 > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) > at java.lang.Thread.run(Thread.java:662) > -------------------- > > look like you've been the first person to find an issue in some code that is very, very fresh. File a bug report on JIRA; try to replicate it on the latest apache alpha release if you can.
-
Re: DataNodes fail to send heartbeat to HA-enabled NameNodeHarsh J 2012-10-30, 18:14
Moving to [EMAIL PROTECTED]
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as it may be a CDH4 specific problem. Could you share your whole DN log (from startup until heartbeat errors) please? I suspect its a problem with DN registration, that the log will help confirm. On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <[EMAIL PROTECTED]> wrote: > Hello, > > I have trouble in quorum-based HDFS HA of CDH 4.1.1. > > NameNode Web UI of Cloudera Manager reports NameNode status. > Its has "Cluster Summary" section and my cluster is summarized > there like below. > > --- Cluster Summary --- > Configured Capacity : 0 KB > DFS Used : 0 KB > Non DFS Used : 0 KB > DFS Remaining : 0 KB > DFS Used% : 100 % > DFS Remaining% : 0 % > Block Pool Used : 0 KB > Block Pool Used% : 100 % > DataNodes usages : Min % Median % Max % stdev % > 0 % 0 % 0 % 0 % > Live Nodes : 0 (Decommissioned: 0) > Dead Nodes : 5 (Decommissioned: 0) > Decommissioning Nodes : 0 > -------------------- > > As you can see, all the DataNodes are regarded as dead. > > I found DataNodes continued to emit logs about failure to > send heartbeat to NameNode. > > ---- DataNode Log (host names were manually edited) --- > 2012-10-30 19:28:16,817 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode > node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of > 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: > 0msec; heartBeatInterval=3000 > 2012-10-30 19:28:16,817 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > BPOfferService for Block pool > BP-2063217961-192.168.62.231-1351263110470 (storage id > DS-2090122187-192.168.62.233-50010-1338981658216) service to > node02.example.com/192.168.62.232:8020 > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) > at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) > at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) > at java.lang.Thread.run(Thread.java:662) > -------------------- > > So, I guess that DataNodes are failing to locate the name service > for some reasons, but I don't have any clue to solve the problem. > > I confirmed that > /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml > of a DataNode contains > > --- core-site.xml --- > <property> > <name>fs.defaultFS</name> > <value>hdfs://nameservice1</value> > </property> > -------------------- > > and hdfs-site.xml contains > > --- hdfs-site.xml --- > <property> > <name>dfs.nameservices</name> > <value>nameservice1</value> > </property> > <property> > <name>dfs.client.failover.proxy.provider.nameservice1</name> > <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> > </property> > <property> > <name>dfs.ha.namenodes.nameservice1</name> > <value>namenode38,namenode90</value> > </property> > <property> > <name>dfs.namenode.rpc-address.nameservice1.namenode38</name> > <value>node01.example.com:8020</value> > </property> > <property> > <name>dfs.namenode.http-address.nameservice1.namenode38</name> > <value>node01.example.com:50070</value> > </property> > <property> > <name>dfs.namenode.https-address.nameservice1.namenode38</name> > <value>node01.example.com:50470</value> > </property> > <property> > <name>dfs.namenode.rpc-address.nameservice1.namenode90</name> > <value>node02.example.com:8020</value> > </property> > <property> > <name>dfs.namenode.http-address.nameservice1.namenode90</name> > <value>node02.example.com:50070</value> > </property> > <property> > <name>dfs.namenode.https-address.nameservice1.namenode90</name> Harsh J
-
Re: DataNodes fail to send heartbeat to HA-enabled NameNodeTodd Lipcon 2012-10-30, 18:23
Hi Takahiko,
Can you please provide the full datanode log up to the point where you first see an NPE? FWIW, this error has nothing to do with the new QuorumJournalManager feature -- I've seen this bug once or twice over the last couple years but never been able to reproduce it reliably. -Todd On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <[EMAIL PROTECTED]>wrote: > > > On 30 October 2012 11:10, Takahiko Kawasaki <[EMAIL PROTECTED]> wrote: > >> Hello, >> >> I have trouble in quorum-based HDFS HA of CDH 4.1.1. >> >> 2012-10-30 19:28:16,817 ERROR >> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in >> BPOfferService for Block pool >> BP-2063217961-192.168.62.231-1351263110470 (storage id >> DS-2090122187-192.168.62.233-50010-1338981658216) service to >> node02.example.com/192.168.62.232:8020 >> java.lang.NullPointerException >> at >> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) >> at >> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) >> at >> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) >> at java.lang.Thread.run(Thread.java:662) >> -------------------- >> >> > look like you've been the first person to find an issue in some code that > is very, very fresh. > > File a bug report on JIRA; try to replicate it on the latest apache alpha > release if you can. > -- Todd Lipcon Software Engineer, Cloudera
-
Re: DataNodes fail to send heartbeat to HA-enabled NameNodeTodd Lipcon 2012-10-30, 20:16
BTW, I forgot that I did file a ticket a while back on a related issue:
https://issues.apache.org/jira/browse/hdfs-2882 My assumption is that, higher up in the logs, you will find an underlying issue which caused NPEs later. -Todd On Tue, Oct 30, 2012 at 11:23 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hi Takahiko, > > Can you please provide the full datanode log up to the point where you > first see an NPE? > > FWIW, this error has nothing to do with the new QuorumJournalManager > feature -- I've seen this bug once or twice over the last couple years but > never been able to reproduce it reliably. > > -Todd > > > On Tue, Oct 30, 2012 at 10:06 AM, Steve Loughran <[EMAIL PROTECTED]>wrote: > >> >> >> On 30 October 2012 11:10, Takahiko Kawasaki <[EMAIL PROTECTED]> wrote: >> >>> Hello, >>> >>> I have trouble in quorum-based HDFS HA of CDH 4.1.1. >>> >>> 2012-10-30 19:28:16,817 ERROR >>> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in >>> BPOfferService for Block pool >>> BP-2063217961-192.168.62.231-1351263110470 (storage id >>> DS-2090122187-192.168.62.233-50010-1338981658216) service to >>> node02.example.com/192.168.62.232:8020 >>> java.lang.NullPointerException >>> at >>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) >>> at >>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) >>> at >>> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) >>> at java.lang.Thread.run(Thread.java:662) >>> -------------------- >>> >>> >> look like you've been the first person to find an issue in some code that >> is very, very fresh. >> >> File a bug report on JIRA; try to replicate it on the latest apache alpha >> release if you can. >> > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- Todd Lipcon Software Engineer, Cloudera |