|
|
-
Re: DataNodes fail to send heartbeat to HA-enabled NameNodeHarsh J 2012-10-30, 18:14
Moving to [EMAIL PROTECTED]
(https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user), as it may be a CDH4 specific problem. Could you share your whole DN log (from startup until heartbeat errors) please? I suspect its a problem with DN registration, that the log will help confirm. On Tue, Oct 30, 2012 at 4:40 PM, Takahiko Kawasaki <[EMAIL PROTECTED]> wrote: > Hello, > > I have trouble in quorum-based HDFS HA of CDH 4.1.1. > > NameNode Web UI of Cloudera Manager reports NameNode status. > Its has "Cluster Summary" section and my cluster is summarized > there like below. > > --- Cluster Summary --- > Configured Capacity : 0 KB > DFS Used : 0 KB > Non DFS Used : 0 KB > DFS Remaining : 0 KB > DFS Used% : 100 % > DFS Remaining% : 0 % > Block Pool Used : 0 KB > Block Pool Used% : 100 % > DataNodes usages : Min % Median % Max % stdev % > 0 % 0 % 0 % 0 % > Live Nodes : 0 (Decommissioned: 0) > Dead Nodes : 5 (Decommissioned: 0) > Decommissioning Nodes : 0 > -------------------- > > As you can see, all the DataNodes are regarded as dead. > > I found DataNodes continued to emit logs about failure to > send heartbeat to NameNode. > > ---- DataNode Log (host names were manually edited) --- > 2012-10-30 19:28:16,817 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: For namenode > node02.example.com/192.168.62.232:8020 using DELETEREPORT_INTERVAL of > 300000 msec BLOCKREPORT_INTERVAL of 21600000msec Initial delay: > 0msec; heartBeatInterval=3000 > 2012-10-30 19:28:16,817 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > BPOfferService for Block pool > BP-2063217961-192.168.62.231-1351263110470 (storage id > DS-2090122187-192.168.62.233-50010-1338981658216) service to > node02.example.com/192.168.62.232:8020 > java.lang.NullPointerException > at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.sendHeartBeat(BPServiceActor.java:435) > at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:521) > at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:674) > at java.lang.Thread.run(Thread.java:662) > -------------------- > > So, I guess that DataNodes are failing to locate the name service > for some reasons, but I don't have any clue to solve the problem. > > I confirmed that > /var/run/cloudera-scm-agent/process/???-hdfs-DATANODE/core-site.xml > of a DataNode contains > > --- core-site.xml --- > <property> > <name>fs.defaultFS</name> > <value>hdfs://nameservice1</value> > </property> > -------------------- > > and hdfs-site.xml contains > > --- hdfs-site.xml --- > <property> > <name>dfs.nameservices</name> > <value>nameservice1</value> > </property> > <property> > <name>dfs.client.failover.proxy.provider.nameservice1</name> > <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> > </property> > <property> > <name>dfs.ha.namenodes.nameservice1</name> > <value>namenode38,namenode90</value> > </property> > <property> > <name>dfs.namenode.rpc-address.nameservice1.namenode38</name> > <value>node01.example.com:8020</value> > </property> > <property> > <name>dfs.namenode.http-address.nameservice1.namenode38</name> > <value>node01.example.com:50070</value> > </property> > <property> > <name>dfs.namenode.https-address.nameservice1.namenode38</name> > <value>node01.example.com:50470</value> > </property> > <property> > <name>dfs.namenode.rpc-address.nameservice1.namenode90</name> > <value>node02.example.com:8020</value> > </property> > <property> > <name>dfs.namenode.http-address.nameservice1.namenode90</name> > <value>node02.example.com:50070</value> > </property> > <property> > <name>dfs.namenode.https-address.nameservice1.namenode90</name> Harsh J |