|
|
-
Namenode hangs on startup
Jianhui Zhang 2012-07-02, 09:04
Hi, Apache Hadoop 0.20.205. I'm trying to restart NN and it always hangs at the very beginning. The only logs I've got are: /************************************************************ STARTUP_MSG: Starting NameNode STARTUP_MSG: host = host/ip STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.205.0 STARTUP_MSG: build https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205-r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 ************************************************************/ 2012-07-02 01:33:01,281 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-07-02 01:33:01,290 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-07-02 01:33:01,292 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-07-02 01:33:01,292 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics system started 2012-07-02 01:33:01,434 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-07-02 01:33:01,436 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! 2012-07-02 01:33:01,441 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source jvm registered. 2012-07-02 01:33:01,441 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source NameNode registered. 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type = 64-bit 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max memory = 314.0275 MB 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: capacity = 2^25 = 33554432 entries 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: recommended=33554432, actual=33554432 2012-07-02 01:33:01,546 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner 2012-07-02 01:33:01,546 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2012-07-02 01:33:01,546 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2012-07-02 01:33:01,550 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: dfs.block.invalidate.limit=100 2012-07-02 01:33:01,550 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), accessTokenLifetime=0 min(s) 2012-07-02 01:33:01,787 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStateMBean and NameNodeMXBean 2012-07-02 01:33:01,802 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names occuring more than 10 times 2012-07-02 01:33:01,811 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032 2012-07-02 01:33:02,406 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 0 2012-07-02 01:33:02,406 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2553316 loaded in 0 seconds. 2012-07-02 01:33:02,410 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0 seconds. =================================== It hangs thereafter.... I wonder if anybody has seen this before? Some background: I shut down DFS and MR while there were still jobs running. Some MR jobs were hanging, so I manually killed the children JVMs after the shutdown. Not sure how such actions would affect NN startup. Any help would be appreciated. Thanks, James
+
Jianhui Zhang 2012-07-02, 09:04
-
Re: Namenode hangs on startup
Harsh J 2012-07-02, 10:21
Jianhui, Can you pastebin.com the output of your "jstack <NN PID>" command after its hung, and pass us the paste link please? It looks to me like it may have just been merging/saving the image, and that may be slow but it depends on how long did you have to wait around to see NN resume and begin properly. On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: > Hi, > > Apache Hadoop 0.20.205. > > I'm trying to restart NN and it always hangs at the very beginning. > The only logs I've got are: > > /************************************************************ > STARTUP_MSG: Starting NameNode > STARTUP_MSG: host = host/ip > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.20.205.0 > STARTUP_MSG: build > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 > ************************************************************/ > 2012-07-02 01:33:01,281 INFO > org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 2012-07-02 01:33:01,290 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > MetricsSystem,sub=Stats registered. > 2012-07-02 01:33:01,292 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot > period at 10 second(s). > 2012-07-02 01:33:01,292 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics > system started > 2012-07-02 01:33:01,434 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > ugi registered. > 2012-07-02 01:33:01,436 WARN > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi > already exists! > 2012-07-02 01:33:01,441 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > jvm registered. > 2012-07-02 01:33:01,441 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > NameNode registered. > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type > = 64-bit > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max > memory = 314.0275 MB > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: > capacity = 2^25 = 33554432 entries > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: > recommended=33554432, actual=33554432 > 2012-07-02 01:33:01,546 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner > 2012-07-02 01:33:01,546 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > supergroup=supergroup > 2012-07-02 01:33:01,546 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isPermissionEnabled=true > 2012-07-02 01:33:01,550 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > dfs.block.invalidate.limit=100 > 2012-07-02 01:33:01,550 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), > accessTokenLifetime=0 min(s) > 2012-07-02 01:33:01,787 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered > FSNamesystemStateMBean and NameNodeMXBean > 2012-07-02 01:33:01,802 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names > occuring more than 10 times > 2012-07-02 01:33:01,811 INFO > org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032 > 2012-07-02 01:33:02,406 INFO > org.apache.hadoop.hdfs.server.common.Storage: Number of files under > construction = 0 > 2012-07-02 01:33:02,406 INFO > org.apache.hadoop.hdfs.server.common.Storage: Image file of size > 2553316 loaded in 0 seconds. > 2012-07-02 01:33:02,410 INFO > org.apache.hadoop.hdfs.server.common.Storage: Edits file > /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0 > seconds. > > ===================================> > It hangs thereafter.... I wonder if anybody has seen this before? > > Some background: I shut down DFS and MR while there were still jobs > running. Some MR jobs were hanging, so I manually killed the children Harsh J
+
Harsh J 2012-07-02, 10:21
-
Re: Namenode hangs on startup
Jianhui Zhang 2012-07-02, 19:04
Hi folks, Thanks for helping, especially at such earlier hours. After leaving it overnight, during which period nothing happened in the log, I restarted this morning. This time, it passed the previously stuck point, and reached all the way to "IPC Server handler.. starting", in Safe Mode. So it looks more promising now. But it's in a state of: "The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically." Does that mean the NN is waiting for DNs's communications/updates? How can I tell whether it's stuck or just slow? The NN log is at: http://pastebin.com/5fvRfRSDThe jstack output is at: http://pastebin.com/RnDXWrtcThe configurations are really basic: core-site.xml: <configuration> <property> <name>fs.default.name</name> <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value> <final>true</final> </property> <property> <name>io.file.buffer.size</name> <value>65536</value> </property> </configuration> It's the same for all nodes. Again, appreciate your help. Thanks, James On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <[EMAIL PROTECTED]> wrote: > Jianhui, > > Can you pastebin.com the output of your "jstack <NN PID>" command > after its hung, and pass us the paste link please? It looks to me like > it may have just been merging/saving the image, and that may be slow > but it depends on how long did you have to wait around to see NN > resume and begin properly. > > On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Apache Hadoop 0.20.205. >> >> I'm trying to restart NN and it always hangs at the very beginning. >> The only logs I've got are: >> >> /************************************************************ >> STARTUP_MSG: Starting NameNode >> STARTUP_MSG: host = host/ip >> STARTUP_MSG: args = [] >> STARTUP_MSG: version = 0.20.205.0 >> STARTUP_MSG: build >> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >> ************************************************************/ >> 2012-07-02 01:33:01,281 INFO >> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >> hadoop-metrics2.properties >> 2012-07-02 01:33:01,290 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> MetricsSystem,sub=Stats registered. >> 2012-07-02 01:33:01,292 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >> period at 10 second(s). >> 2012-07-02 01:33:01,292 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics >> system started >> 2012-07-02 01:33:01,434 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> ugi registered. >> 2012-07-02 01:33:01,436 WARN >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >> already exists! >> 2012-07-02 01:33:01,441 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> jvm registered. >> 2012-07-02 01:33:01,441 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> NameNode registered. >> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type >> = 64-bit >> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max >> memory = 314.0275 MB >> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >> capacity = 2^25 = 33554432 entries >> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >> recommended=33554432, actual=33554432 >> 2012-07-02 01:33:01,546 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner >> 2012-07-02 01:33:01,546 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >> supergroup=supergroup >> 2012-07-02 01:33:01,546 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >> isPermissionEnabled=true >> 2012-07-02 01:33:01,550 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
+
Jianhui Zhang 2012-07-02, 19:04
-
Re: Namenode hangs on startup
Harsh J 2012-07-02, 19:07
Juanhui, It is merely waiting for the DNs to start, and to report its blocks in. This does not take long once the DNs are up and running. Do you see any Live Nodes yet? On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: > Hi folks, > > Thanks for helping, especially at such earlier hours. > > After leaving it overnight, during which period nothing happened in > the log, I restarted this morning. This time, it passed the previously > stuck point, and reached all the way to "IPC Server handler.. > starting", in Safe Mode. So it looks more promising now. > > But it's in a state of: > > "The ratio of reported blocks 0.0000 has not reached the threshold > 0.9990. Safe mode will be turned off automatically." > > Does that mean the NN is waiting for DNs's communications/updates? How > can I tell whether it's stuck or just slow? > > The NN log is at: http://pastebin.com/5fvRfRSD> > The jstack output is at: http://pastebin.com/RnDXWrtc> > The configurations are really basic: > > core-site.xml: > > <configuration> > <property> > <name>fs.default.name</name> > <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value> > <final>true</final> > </property> > <property> > <name>io.file.buffer.size</name> > <value>65536</value> > </property> > </configuration> > > It's the same for all nodes. > > Again, appreciate your help. > > Thanks, > James > > On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <[EMAIL PROTECTED]> wrote: >> Jianhui, >> >> Can you pastebin.com the output of your "jstack <NN PID>" command >> after its hung, and pass us the paste link please? It looks to me like >> it may have just been merging/saving the image, and that may be slow >> but it depends on how long did you have to wait around to see NN >> resume and begin properly. >> >> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> Apache Hadoop 0.20.205. >>> >>> I'm trying to restart NN and it always hangs at the very beginning. >>> The only logs I've got are: >>> >>> /************************************************************ >>> STARTUP_MSG: Starting NameNode >>> STARTUP_MSG: host = host/ip >>> STARTUP_MSG: args = [] >>> STARTUP_MSG: version = 0.20.205.0 >>> STARTUP_MSG: build >>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205>>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >>> ************************************************************/ >>> 2012-07-02 01:33:01,281 INFO >>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>> hadoop-metrics2.properties >>> 2012-07-02 01:33:01,290 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> MetricsSystem,sub=Stats registered. >>> 2012-07-02 01:33:01,292 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>> period at 10 second(s). >>> 2012-07-02 01:33:01,292 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics >>> system started >>> 2012-07-02 01:33:01,434 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> ugi registered. >>> 2012-07-02 01:33:01,436 WARN >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >>> already exists! >>> 2012-07-02 01:33:01,441 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> jvm registered. >>> 2012-07-02 01:33:01,441 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> NameNode registered. >>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type >>> = 64-bit >>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max >>> memory = 314.0275 MB >>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>> capacity = 2^25 = 33554432 entries >>> 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: >>> recommended=33554432, actual=33554432 >>> 2012-07-02 01:33:01,546 INFO Harsh J
+
Harsh J 2012-07-02, 19:07
-
Re: Namenode hangs on startup
Jianhui Zhang 2012-07-02, 19:32
Hi Harsh, The web portal of the NN shows 0 nodes. Looking into each node's log, all nodes but one have been staying at: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.205.0 STARTUP_MSG: build https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205-r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 ************************************************************/ 2012-07-02 11:00:38,891 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-07-02 11:00:38,906 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-07-02 11:00:38,908 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-07-02 11:00:38,908 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started 2012-07-02 11:00:39,091 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-07-02 11:00:39,094 WARN org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi already exists! ============================ The one node passed that and reached: 2012-07-02 12:24:58,450 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_-2932980158620691384_129517 Does that mean progress? It's been 1.5 hour since the start. And the file system side is: 74515 files and directories, 28439 blocks 102954 total Thanks for helping. James On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <[EMAIL PROTECTED]> wrote: > Juanhui, > > It is merely waiting for the DNs to start, and to report its blocks > in. This does not take long once the DNs are up and running. Do you > see any Live Nodes yet? > > On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: >> Hi folks, >> >> Thanks for helping, especially at such earlier hours. >> >> After leaving it overnight, during which period nothing happened in >> the log, I restarted this morning. This time, it passed the previously >> stuck point, and reached all the way to "IPC Server handler.. >> starting", in Safe Mode. So it looks more promising now. >> >> But it's in a state of: >> >> "The ratio of reported blocks 0.0000 has not reached the threshold >> 0.9990. Safe mode will be turned off automatically." >> >> Does that mean the NN is waiting for DNs's communications/updates? How >> can I tell whether it's stuck or just slow? >> >> The NN log is at: http://pastebin.com/5fvRfRSD>> >> The jstack output is at: http://pastebin.com/RnDXWrtc>> >> The configurations are really basic: >> >> core-site.xml: >> >> <configuration> >> <property> >> <name>fs.default.name</name> >> <value>hdfs://pipeline-hdnn01-virtual.x.y.z:8020</value> >> <final>true</final> >> </property> >> <property> >> <name>io.file.buffer.size</name> >> <value>65536</value> >> </property> >> </configuration> >> >> It's the same for all nodes. >> >> Again, appreciate your help. >> >> Thanks, >> James >> >> On Mon, Jul 2, 2012 at 3:21 AM, Harsh J <[EMAIL PROTECTED]> wrote: >>> Jianhui, >>> >>> Can you pastebin.com the output of your "jstack <NN PID>" command >>> after its hung, and pass us the paste link please? It looks to me like >>> it may have just been merging/saving the image, and that may be slow >>> but it depends on how long did you have to wait around to see NN >>> resume and begin properly. >>> >>> On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> Apache Hadoop 0.20.205. >>>> >>>> I'm trying to restart NN and it always hangs at the very beginning. >>>> The only logs I've got are: >>>> >>>> /************************************************************ >>>> STARTUP_MSG: Starting NameNode >>>> STARTUP_MSG: host = host/ip >
+
Jianhui Zhang 2012-07-02, 19:32
-
Re: Namenode hangs on startup
David Rosenstrauch 2012-07-02, 19:56
A couple of thoughts: 1) It shouldn't take more than a couple of minutes for the data nodes to re-register after the name node comes back up. If the data nodes aren't registering, you might try restarting the data node daemons on each machine. They should all register soon after. (If not, then you have other problems involved.) 2) Re: the original symptom you reported about the namenode "hanging" at startup: you should read up on the "secondary namenode" component in Hadoop (you might start at http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode and http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/). If you don't have a secondary namenode running in your cluster, that can have a significantly negative impact on namenode startup time. (The namenode needs to replay the log at startup - which can take a long time if the log has gotten large.) HTH, DR On 07/02/2012 03:32 PM, Jianhui Zhang wrote: > Hi Harsh, > > The web portal of the NN shows 0 nodes. > > Looking into each node's log, all nodes but one have been staying at: > > /************************************************************ > STARTUP_MSG: Starting DataNode > STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.20.205.0 > STARTUP_MSG: build > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 > ************************************************************/ > 2012-07-02 11:00:38,891 INFO > org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 2012-07-02 11:00:38,906 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > MetricsSystem,sub=Stats registered. > 2012-07-02 11:00:38,908 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot > period at 10 second(s). > 2012-07-02 11:00:38,908 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics > system started > 2012-07-02 11:00:39,091 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > ugi registered. > 2012-07-02 11:00:39,094 WARN > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi > already exists! > ============================> > The one node passed that and reached: > > 2012-07-02 12:24:58,450 INFO > org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification > succeeded for blk_-2932980158620691384_129517 > > Does that mean progress? It's been 1.5 hour since the start. And the > file system side is: 74515 files and directories, 28439 blocks > 102954 total > > Thanks for helping. > James > > On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <[EMAIL PROTECTED]> wrote: >> Juanhui, >> >> It is merely waiting for the DNs to start, and to report its blocks >> in. This does not take long once the DNs are up and running. Do you >> see any Live Nodes yet? >> >> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: >>> Hi folks, >>> >>> Thanks for helping, especially at such earlier hours. >>> >>> After leaving it overnight, during which period nothing happened in >>> the log, I restarted this morning. This time, it passed the previously >>> stuck point, and reached all the way to "IPC Server handler.. >>> starting", in Safe Mode. So it looks more promising now. >>> >>> But it's in a state of: >>> >>> "The ratio of reported blocks 0.0000 has not reached the threshold >>> 0.9990. Safe mode will be turned off automatically." >>> >>> Does that mean the NN is waiting for DNs's communications/updates? How >>> can I tell whether it's stuck or just slow? >>> >>> The NN log is at: http://pastebin.com/5fvRfRSD>>> >>> The jstack output is at: http://pastebin.com/RnDXWrtc>>> >>> The configurations are really basic: >>> >>> core-site.xml: >>> >>> <configuration> >>> <property> >>> <name>fs.default.name</name>
+
David Rosenstrauch 2012-07-02, 19:56
-
Re: Namenode hangs on startup
Jianhui Zhang 2012-07-02, 22:30
Thanks for helping. I do have a secondary namenode. And I'm starting one datanode at a time now. Previously I was using start-dfs.sh to start NN and slaves. I observed pretty strange behaviors: 1. If use start-dfs.sh, almost none of the DN joined. All seemed stuck. 2. If I start one DN at a time, SOME joined instantly, but some looked hanging... The funny thing is: if I start over again, the behavior changed for each node.... Any idea? Thanks, James On Mon, Jul 2, 2012 at 12:56 PM, David Rosenstrauch <[EMAIL PROTECTED]> wrote: > A couple of thoughts: > > 1) It shouldn't take more than a couple of minutes for the data nodes to > re-register after the name node comes back up. If the data nodes aren't > registering, you might try restarting the data node daemons on each machine. > They should all register soon after. (If not, then you have other problems > involved.) > > 2) Re: the original symptom you reported about the namenode "hanging" at > startup: you should read up on the "secondary namenode" component in Hadoop > (you might start at > http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode> and > http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/). > If you don't have a secondary namenode running in your cluster, that can > have a significantly negative impact on namenode startup time. (The > namenode needs to replay the log at startup - which can take a long time if > the log has gotten large.) > > HTH, > > DR > > > On 07/02/2012 03:32 PM, Jianhui Zhang wrote: >> >> Hi Harsh, >> >> The web portal of the NN shows 0 nodes. >> >> Looking into each node's log, all nodes but one have been staying at: >> >> /************************************************************ >> STARTUP_MSG: Starting DataNode >> STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 >> STARTUP_MSG: args = [] >> STARTUP_MSG: version = 0.20.205.0 >> STARTUP_MSG: build >> >> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >> ************************************************************/ >> 2012-07-02 11:00:38,891 INFO >> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >> hadoop-metrics2.properties >> 2012-07-02 11:00:38,906 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> MetricsSystem,sub=Stats registered. >> 2012-07-02 11:00:38,908 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >> period at 10 second(s). >> 2012-07-02 11:00:38,908 INFO >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics >> system started >> 2012-07-02 11:00:39,091 INFO >> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >> ugi registered. >> 2012-07-02 11:00:39,094 WARN >> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >> already exists! >> ============================>> >> The one node passed that and reached: >> >> 2012-07-02 12:24:58,450 INFO >> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification >> succeeded for blk_-2932980158620691384_129517 >> >> Does that mean progress? It's been 1.5 hour since the start. And the >> file system side is: 74515 files and directories, 28439 blocks >> 102954 total >> >> Thanks for helping. >> James >> >> On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>> >>> Juanhui, >>> >>> It is merely waiting for the DNs to start, and to report its blocks >>> in. This does not take long once the DNs are up and running. Do you >>> see any Live Nodes yet? >>> >>> On Tue, Jul 3, 2012 at 12:34 AM, Jianhui Zhang <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Hi folks, >>>> >>>> Thanks for helping, especially at such earlier hours. >>>> >>>> After leaving it overnight, during which period nothing happened in >>>> the log, I restarted this morning. This time, it passed the previously >>>> stuck point, and reached all the way to "IPC Server handler..
+
Jianhui Zhang 2012-07-02, 22:30
-
Re: Namenode hangs on startup
Harsh J 2012-07-03, 02:21
James, You may be suffering from this old issue that deals with system entropy? http://search-hadoop.com/m/7Giae6vLWR1You'll know for sure if you jstack your DNs and run it by the above thread's comments. On Tue, Jul 3, 2012 at 4:00 AM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: > Thanks for helping. > > I do have a secondary namenode. > > And I'm starting one datanode at a time now. Previously I was using > start-dfs.sh to start NN and slaves. I observed pretty strange > behaviors: > > 1. If use start-dfs.sh, almost none of the DN joined. All seemed stuck. > > 2. If I start one DN at a time, SOME joined instantly, but some looked > hanging... The funny thing is: if I start over again, the behavior > changed for each node.... > > Any idea? > > Thanks, > James > > > On Mon, Jul 2, 2012 at 12:56 PM, David Rosenstrauch <[EMAIL PROTECTED]> wrote: >> A couple of thoughts: >> >> 1) It shouldn't take more than a couple of minutes for the data nodes to >> re-register after the name node comes back up. If the data nodes aren't >> registering, you might try restarting the data node daemons on each machine. >> They should all register soon after. (If not, then you have other problems >> involved.) >> >> 2) Re: the original symptom you reported about the namenode "hanging" at >> startup: you should read up on the "secondary namenode" component in Hadoop >> (you might start at >> http://hadoop.apache.org/common/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode>> and >> http://www.cloudera.com/blog/2009/02/multi-host-secondarynamenode-configuration/). >> If you don't have a secondary namenode running in your cluster, that can >> have a significantly negative impact on namenode startup time. (The >> namenode needs to replay the log at startup - which can take a long time if >> the log has gotten large.) >> >> HTH, >> >> DR >> >> >> On 07/02/2012 03:32 PM, Jianhui Zhang wrote: >>> >>> Hi Harsh, >>> >>> The web portal of the NN shows 0 nodes. >>> >>> Looking into each node's log, all nodes but one have been staying at: >>> >>> /************************************************************ >>> STARTUP_MSG: Starting DataNode >>> STARTUP_MSG: host = pipeline09.x.y.z/10.2.20.109 >>> STARTUP_MSG: args = [] >>> STARTUP_MSG: version = 0.20.205.0 >>> STARTUP_MSG: build >>> >>> https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205>>> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 >>> ************************************************************/ >>> 2012-07-02 11:00:38,891 INFO >>> org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from >>> hadoop-metrics2.properties >>> 2012-07-02 11:00:38,906 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> MetricsSystem,sub=Stats registered. >>> 2012-07-02 11:00:38,908 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot >>> period at 10 second(s). >>> 2012-07-02 11:00:38,908 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics >>> system started >>> 2012-07-02 11:00:39,091 INFO >>> org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source >>> ugi registered. >>> 2012-07-02 11:00:39,094 WARN >>> org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi >>> already exists! >>> ============================>>> >>> The one node passed that and reached: >>> >>> 2012-07-02 12:24:58,450 INFO >>> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification >>> succeeded for blk_-2932980158620691384_129517 >>> >>> Does that mean progress? It's been 1.5 hour since the start. And the >>> file system side is: 74515 files and directories, 28439 blocks >>> 102954 total >>> >>> Thanks for helping. >>> James >>> >>> On Mon, Jul 2, 2012 at 12:07 PM, Harsh J <[EMAIL PROTECTED]> wrote: >>>> >>>> Juanhui, >>>> >>>> It is merely waiting for the DNs to start, and to report its blocks >>>> in. This does not take long once the DNs are up and running. Do you Harsh J
+
Harsh J 2012-07-03, 02:21
-
Re: Namenode hangs on startup
Mohammad Tariq 2012-07-02, 09:32
Hello Jianhui, Could you please post your configurations, if possible?? Regards, Mohammad Tariq On Mon, Jul 2, 2012 at 2:34 PM, Jianhui Zhang <[EMAIL PROTECTED]> wrote: > Hi, > > Apache Hadoop 0.20.205. > > I'm trying to restart NN and it always hangs at the very beginning. > The only logs I've got are: > > /************************************************************ > STARTUP_MSG: Starting NameNode > STARTUP_MSG: host = host/ip > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.20.205.0 > STARTUP_MSG: build > https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security-205> -r 1179940; compiled by 'hortonfo' on Fri Oct 7 06:20:32 UTC 2011 > ************************************************************/ > 2012-07-02 01:33:01,281 INFO > org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from > hadoop-metrics2.properties > 2012-07-02 01:33:01,290 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > MetricsSystem,sub=Stats registered. > 2012-07-02 01:33:01,292 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot > period at 10 second(s). > 2012-07-02 01:33:01,292 INFO > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NameNode metrics > system started > 2012-07-02 01:33:01,434 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > ugi registered. > 2012-07-02 01:33:01,436 WARN > org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Source name ugi > already exists! > 2012-07-02 01:33:01,441 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > jvm registered. > 2012-07-02 01:33:01,441 INFO > org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source > NameNode registered. > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: VM type > = 64-bit > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: 2% max > memory = 314.0275 MB > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: > capacity = 2^25 = 33554432 entries > 2012-07-02 01:33:01,463 INFO org.apache.hadoop.hdfs.util.GSet: > recommended=33554432, actual=33554432 > 2012-07-02 01:33:01,546 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=owner > 2012-07-02 01:33:01,546 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > supergroup=supergroup > 2012-07-02 01:33:01,546 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isPermissionEnabled=true > 2012-07-02 01:33:01,550 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > dfs.block.invalidate.limit=100 > 2012-07-02 01:33:01,550 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isAccessTokenEnabled=false accessKeyUpdateInterval=0 min(s), > accessTokenLifetime=0 min(s) > 2012-07-02 01:33:01,787 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered > FSNamesystemStateMBean and NameNodeMXBean > 2012-07-02 01:33:01,802 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: Caching file names > occuring more than 10 times > 2012-07-02 01:33:01,811 INFO > org.apache.hadoop.hdfs.server.common.Storage: Number of files = 17032 > 2012-07-02 01:33:02,406 INFO > org.apache.hadoop.hdfs.server.common.Storage: Number of files under > construction = 0 > 2012-07-02 01:33:02,406 INFO > org.apache.hadoop.hdfs.server.common.Storage: Image file of size > 2553316 loaded in 0 seconds. > 2012-07-02 01:33:02,410 INFO > org.apache.hadoop.hdfs.server.common.Storage: Edits file > /apr/hdfs/name/current/edits of size 498 edits # 7 loaded in 0 > seconds. > > ===================================> > It hangs thereafter.... I wonder if anybody has seen this before? > > Some background: I shut down DFS and MR while there were still jobs > running. Some MR jobs were hanging, so I manually killed the children > JVMs after the shutdown. Not sure how such actions would affect NN > startup. > > Any help would be appreciated. > > Thanks, > James
+
Mohammad Tariq 2012-07-02, 09:32
|
|