|
|
-
desperate question about NameNode startup sequence
Meng Mao 2011-12-17, 07:00
Our CDH2 production grid just crashed with some sort of master node failure. When I went in there, JobTracker was missing and NameNode was up. Trying to ls on HDFS met with no connection.
We decided to go for a restart. This is in the namenode log right now:
2011-12-17 01:37:35,568 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,612 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2011-12-17 01:37:35,613 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2011-12-17 01:37:35,620 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2011-12-17 01:37:35,621 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 loaded in 348 seconds. 2011-12-17 01:43:24,030 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size 3885 edits # 43 loaded in 0 seconds. What's coming up in the startup sequence? We have a ton of data on there. Is there any way to estimate startup time?
-
Re: desperate question about NameNode startup sequence
Meng Mao 2011-12-17, 07:53
All of the worker nodes datanodes' logs haven't logged anything after the initial startup announcement: STARTUP_MSG: host = prod1-worker075/10.2.19.75 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.1+169.56 STARTUP_MSG: build = -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; compiled by 'root' on Tue Feb 9 13:40:08 EST 2010 ************************************************************/
On Sat, Dec 17, 2011 at 2:00 AM, Meng Mao <[EMAIL PROTECTED]> wrote:
> Our CDH2 production grid just crashed with some sort of master node > failure. > When I went in there, JobTracker was missing and NameNode was up. > Trying to ls on HDFS met with no connection. > > We decided to go for a restart. This is in the namenode log right now: > > 2011-12-17 01:37:35,568 INFO > org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: > Initializing NameNodeMeterics using context > object:org.apache.hadoop.metrics.spi.NullContext > 2011-12-17 01:37:35,612 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop > 2011-12-17 01:37:35,613 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup > 2011-12-17 01:37:35,613 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > isPermissionEnabled=true > 2011-12-17 01:37:35,620 INFO > org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: > Initializing FSNamesystemMetrics using context > object:org.apache.hadoop.metrics.spi.NullContext > 2011-12-17 01:37:35,621 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered > FSNamesystemStatusMBean > 2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files = 16978046 > 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: > Number of files under construction = 1 > 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: > Image file of size 2589456651 loaded in 348 seconds. > 2011-12-17 01:43:24,030 INFO org.apache.hadoop.hdfs.server.common.Storage: > Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size > 3885 edits # 43 loaded in 0 seconds. > > > What's coming up in the startup sequence? We have a ton of data on there. > Is there any way to estimate startup time? >
-
Re: desperate question about NameNode startup sequence
Meng Mao 2011-12-17, 08:01
Maybe this is a bad sign -- the edits.new was created before the master node crashed, and is huge:
-bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current total 41G -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage -rw-r--r-- 1 hadoop hadoop 8 Jan 27 2011 fstime -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION
could this mean something was up with our SecondaryNameNode and rolling the edits file?
On Sat, Dec 17, 2011 at 2:53 AM, Meng Mao <[EMAIL PROTECTED]> wrote:
> All of the worker nodes datanodes' logs haven't logged anything after the > initial startup announcement: > STARTUP_MSG: host = prod1-worker075/10.2.19.75 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.20.1+169.56 > STARTUP_MSG: build = -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; > compiled by 'root' on Tue Feb 9 13:40:08 EST 2010 > ************************************************************/ > > On Sat, Dec 17, 2011 at 2:00 AM, Meng Mao <[EMAIL PROTECTED]> wrote: > >> Our CDH2 production grid just crashed with some sort of master node >> failure. >> When I went in there, JobTracker was missing and NameNode was up. >> Trying to ls on HDFS met with no connection. >> >> We decided to go for a restart. This is in the namenode log right now: >> >> 2011-12-17 01:37:35,568 INFO >> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: >> Initializing NameNodeMeterics using context >> object:org.apache.hadoop.metrics.spi.NullContext >> 2011-12-17 01:37:35,612 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=hadoop,hadoop >> 2011-12-17 01:37:35,613 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup >> 2011-12-17 01:37:35,613 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: >> isPermissionEnabled=true >> 2011-12-17 01:37:35,620 INFO >> org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: >> Initializing FSNamesystemMetrics using context >> object:org.apache.hadoop.metrics.spi.NullContext >> 2011-12-17 01:37:35,621 INFO >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered >> FSNamesystemStatusMBean >> 2011-12-17 01:37:35,648 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 >> 2011-12-17 01:43:24,023 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Number of files under >> construction = 1 >> 2011-12-17 01:43:24,025 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 >> loaded in 348 seconds. >> 2011-12-17 01:43:24,030 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Edits file >> /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size 3885 edits # >> 43 loaded in 0 seconds. >> >> >> What's coming up in the startup sequence? We have a ton of data on there. >> Is there any way to estimate startup time? >> > >
-
Re: desperate question about NameNode startup sequence
Meng Mao 2011-12-17, 16:42
The namenode eventually came up. Here's the resumation of the logging:
2011-12-17 01:37:35,648 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 16978046 2011-12-17 01:43:24,023 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files under construction = 1 2011-12-17 01:43:24,025 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 2589456651 loaded in 348 seconds. 2011-12-17 01:43:24,030 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits of size 3885 edits # 43 loaded in 0 seconds. 2011-12-17 03:06:26,731 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Invalid opcode, reached end of edit log Number of transactions found 306757368 2011-12-17 03:06:26,732 INFO org.apache.hadoop.hdfs.server.common.Storage: Edits file /hadoop/hadoop-metadata/cache/dfs/name/current/edits.new of size 41011966085 edits # 306757368 loaded in 4982 seconds. 2011-12-17 03:06:47,264 INFO org.apache.hadoop.hdfs.server.common.Storage: Image file of size 1724849462 saved in 19 seconds. 2011-12-17 03:07:09,051 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading FSImage in 5373458 msecs
It took a long time to load edits.new, consistent with the speed at which it loaded fsimage.
So at 3:12 or so, the namenode came back up. At that time, the fsimage got updated on our secondary namenode: [vmc@prod1-secondary ~]$ ls -l /hadoop/hadoop-metadata/cache/dfs/namesecondary/current/ total 1686124 -rw-r--r-- 1 hadoop hadoop 38117 Dec 17 03:12 edits -rw-r--r-- 1 hadoop hadoop 1724849462 Dec 17 03:12 fsimage
But that's it. No updates since then. Is that normal 2NN behavior? I don't think we've tuned away from the defaults for fsimage and edits maintenance. How should I diagnose?
Similarly, primary namenode seems to continue to log changes to edits.new: -bash-3.2$ ls -l /hadoop/hadoop-metadata/cache/dfs/name/current/ total 1728608 -rw-r--r-- 1 hadoop hadoop 38117 Dec 17 03:12 edits -rw-r--r-- 1 hadoop hadoop 44064633 Dec 17 11:41 edits.new -rw-r--r-- 1 hadoop hadoop 1724849462 Dec 17 03:06 fsimage -rw-r--r-- 1 hadoop hadoop 8 Dec 17 03:07 fstime -rw-r--r-- 1 hadoop hadoop 101 Dec 17 03:07 VERSION
Is this normal? Have I been misunderstanding normal NN operation?
On Sat, Dec 17, 2011 at 3:01 AM, Meng Mao <[EMAIL PROTECTED]> wrote:
> Maybe this is a bad sign -- the edits.new was created before the master > node crashed, and is huge: > > -bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current > total 41G > -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits > -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new > -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage > -rw-r--r-- 1 hadoop hadoop 8 Jan 27 2011 fstime > -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION > > could this mean something was up with our SecondaryNameNode and rolling > the edits file? > > On Sat, Dec 17, 2011 at 2:53 AM, Meng Mao <[EMAIL PROTECTED]> wrote: > >> All of the worker nodes datanodes' logs haven't logged anything after the >> initial startup announcement: >> STARTUP_MSG: host = prod1-worker075/10.2.19.75 >> STARTUP_MSG: args = [] >> STARTUP_MSG: version = 0.20.1+169.56 >> STARTUP_MSG: build = -r 8e662cb065be1c4bc61c55e6bff161e09c1d36f3; >> compiled by 'root' on Tue Feb 9 13:40:08 EST 2010 >> ************************************************************/ >> >> On Sat, Dec 17, 2011 at 2:00 AM, Meng Mao <[EMAIL PROTECTED]> wrote: >> >>> Our CDH2 production grid just crashed with some sort of master node >>> failure. >>> When I went in there, JobTracker was missing and NameNode was up. >>> Trying to ls on HDFS met with no connection. >>> >>> We decided to go for a restart. This is in the namenode log right now: >>> >>> 2011-12-17 01:37:35,568 INFO >>> org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: >>> Initializing NameNodeMeterics using context >>> object:org.apache.hadoop.metrics.spi.NullContext
-
Re: desperate question about NameNode startup sequence
Brock Noland 2011-12-17, 17:19
Hi, Since your using CDH2, I am moving this to CDH-USER. You can subscribe here: http://groups.google.com/a/cloudera.org/group/cdh-userBCC'd common-user On Sat, Dec 17, 2011 at 2:01 AM, Meng Mao <[EMAIL PROTECTED]> wrote: > Maybe this is a bad sign -- the edits.new was created before the master > node crashed, and is huge: > > -bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current > total 41G > -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits > -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new > -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage > -rw-r--r-- 1 hadoop hadoop 8 Jan 27 2011 fstime > -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION > > could this mean something was up with our SecondaryNameNode and rolling the > edits file? Yes it looks like a checkpoint never completed. It's a good idea to monitor the mtime on fsimage to ensure it never gets too old. Has a checkpoint completed since you restarted? Brock
-
Re: desperate question about NameNode startup sequence
Edward Capriolo 2011-12-17, 21:00
The problem with checkpoint /2nn is that it happily "runs" and has no outward indication that it is unable to connect. Because you have a large edits file you startup will complete, however with that size it could take hours. It logs nothing while this is going on but as long as the CPU is working that means it is progressing. We have a nagios check on the size of this directory so if the edit rolling stops we know about it. On Saturday, December 17, 2011, Brock Noland <[EMAIL PROTECTED]> wrote: > Hi, > > Since your using CDH2, I am moving this to CDH-USER. You can subscribe here: > > http://groups.google.com/a/cloudera.org/group/cdh-user> > BCC'd common-user > > On Sat, Dec 17, 2011 at 2:01 AM, Meng Mao <[EMAIL PROTECTED]> wrote: >> Maybe this is a bad sign -- the edits.new was created before the master >> node crashed, and is huge: >> >> -bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current >> total 41G >> -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits >> -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new >> -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage >> -rw-r--r-- 1 hadoop hadoop 8 Jan 27 2011 fstime >> -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION >> >> could this mean something was up with our SecondaryNameNode and rolling the >> edits file? > > Yes it looks like a checkpoint never completed. It's a good idea to > monitor the mtime on fsimage to ensure it never gets too old. > > Has a checkpoint completed since you restarted? > > Brock >
-
Re: desperate question about NameNode startup sequence
Meng Mao 2011-12-17, 21:04
Bruce, thanks for moving this over. I wasn't aware there were new lists for CDH. How should I diagnose if our 2NN is working right now? On Sat, Dec 17, 2011 at 4:00 PM, Edward Capriolo <[EMAIL PROTECTED]>wrote: > The problem with checkpoint /2nn is that it happily "runs" and has no > outward indication that it is unable to connect. > > Because you have a large edits file you startup will complete, however > with that size it could take hours. It logs nothing while this is going on > but as long as the CPU is working that means it is progressing. > > We have a nagios check on the size of this directory so if the edit > rolling stops we know about it. > > > On Saturday, December 17, 2011, Brock Noland <[EMAIL PROTECTED]> wrote: > > Hi, > > > > Since your using CDH2, I am moving this to CDH-USER. You can subscribe > here: > > > > http://groups.google.com/a/cloudera.org/group/cdh-user> > > > BCC'd common-user > > > > On Sat, Dec 17, 2011 at 2:01 AM, Meng Mao <[EMAIL PROTECTED]> wrote: > >> Maybe this is a bad sign -- the edits.new was created before the master > >> node crashed, and is huge: > >> > >> -bash-3.2$ ls -lh /hadoop/hadoop-metadata/cache/dfs/name/current > >> total 41G > >> -rw-r--r-- 1 hadoop hadoop 3.8K Jan 27 2011 edits > >> -rw-r--r-- 1 hadoop hadoop 39G Dec 17 00:44 edits.new > >> -rw-r--r-- 1 hadoop hadoop 2.5G Jan 27 2011 fsimage > >> -rw-r--r-- 1 hadoop hadoop 8 Jan 27 2011 fstime > >> -rw-r--r-- 1 hadoop hadoop 101 Jan 27 2011 VERSION > >> > >> could this mean something was up with our SecondaryNameNode and rolling > the > >> edits file? > > > > Yes it looks like a checkpoint never completed. It's a good idea to > > monitor the mtime on fsimage to ensure it never gets too old. > > > > Has a checkpoint completed since you restarted? > > > > Brock > > >
|
|