|
Terry Healy
2012-05-18, 13:51
Terry Healy
2012-05-18, 13:57
Todd Lipcon
2012-05-18, 16:34
Terry Healy
2012-05-18, 18:28
|
-
Unable to start NN after rack assignment attemptTerry Healy 2012-05-18, 13:51
Running Apache 1.0.2 ~12 datanodes
Ran FSCK / -> OK, before, everything running as expected. Started trying to use a script to assign nodes to racks, which required several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh / start-all.sh too if that matters. Got past errors in script and data file, but dfsadmin -report still showed all assigned to default rack. I tried replacing one system name in the rack mapping file with it's IP address. At this point the NN failed to start up. So I commented out the topology.script.file.name property statements in hdfs-site.xml NN still fails to start; trace below indicating EOF Exception, but I don't know what file it can't read. As always your patience with a noob appreciated; any suggestions to get started again? (I can forget about the rack assignment for now) Thanks.
-
Re: Unable to start NN after rack assignment attemptTerry Healy 2012-05-18, 13:57
Sorry, forgot to attach the trace:
<code> 2012-05-18 09:54:45,355 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128 2012-05-18 09:54:45,379 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) at org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) 2012-05-18 09:54:45,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException at java.io.DataInputStream.readFully(DataInputStream.java:180) at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) at org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) 2012-05-18 09:54:45,380 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: /************************************************************ SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx ************************************************************/ </code> On 05/18/2012 09:51 AM, Terry Healy wrote: > Running Apache 1.0.2 ~12 datanodes > > Ran FSCK / -> OK, before, everything running as expected. > > Started trying to use a script to assign nodes to racks, which required > several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh / > start-all.sh too if that matters. > > Got past errors in script and data file, but dfsadmin -report still > showed all assigned to default rack. I tried replacing one system name > in the rack mapping file with it's IP address. At this point the NN > failed to start up. > > So I commented out the topology.script.file.name property statements in > hdfs-site.xml > > NN still fails to start; trace below indicating EOF Exception, but I > don't know what file it can't read. > > As always your patience with a noob appreciated; any suggestions to get > started again? (I can forget about the rack assignment for now) > > Thanks. > >
-
Re: Unable to start NN after rack assignment attemptTodd Lipcon 2012-05-18, 16:34
Hi Terry,
It seems like something got truncated in your FSImage... though it's unclear how that might have happened. If you're able to share your logs and your dfs.name.dir contents, feel free to contact me off-list and I can try to take a look to diagnose the issue and try to recover the system. Of course whenever any corruption issue occurs we take it seriously and want to get at a root cause to prevent future occurrences! Thanks -Todd On Fri, May 18, 2012 at 6:57 AM, Terry Healy <[EMAIL PROTECTED]> wrote: > Sorry, forgot to attach the trace: > <code> > 2012-05-18 09:54:45,355 INFO > org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128 > 2012-05-18 09:54:45,379 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem > initialization failed. > java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) > 2012-05-18 09:54:45,380 ERROR > org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException > at java.io.DataInputStream.readFully(DataInputStream.java:180) > at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) > at > org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) > at > org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) > at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) > > 2012-05-18 09:54:45,380 INFO > org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: > /************************************************************ > SHUTDOWN_MSG: Shutting down NameNode at abcd/1xx.1xx.2xx.3xx > ************************************************************/ > > </code> > > > > On 05/18/2012 09:51 AM, Terry Healy wrote: >> Running Apache 1.0.2 ~12 datanodes >> >> Ran FSCK / -> OK, before, everything running as expected. >> >> Started trying to use a script to assign nodes to racks, which required >> several stop-dfs.sh / start-dfs.sh cycles. (with some stop-all.sh / >> start-all.sh too if that matters. >> >> Got past errors in script and data file, but dfsadmin -report still >> showed all assigned to default rack. I tried replacing one system name Todd Lipcon Software Engineer, Cloudera
-
Re: Unable to start NN after rack assignment attemptTerry Healy 2012-05-18, 18:28
Todd-
Thanks for your reply. I went out on a limb and started digging in the source code and figures it was FSImage. So I saved it, and copied over the copy from my checkpoint directory and got running again. I ran a few jobs to test and returned to getting a problem new node running. Once again it looks like I will have to manually force an exit from safe mode to run fsck -move I sent mail to Harsh earlier - I think I must migrate to CDH as I fear my manual hacking with configs and such has caused the fragile state that the cluster is in now. Thanks, Terry On 05/18/2012 12:34 PM, Todd Lipcon wrote: > Hi Terry, > > It seems like something got truncated in your FSImage... though it's > unclear how that might have happened. > > If you're able to share your logs and your dfs.name.dir contents, feel > free to contact me off-list and I can try to take a look to diagnose > the issue and try to recover the system. Of course whenever any > corruption issue occurs we take it seriously and want to get at a root > cause to prevent future occurrences! > > Thanks > -Todd > > On Fri, May 18, 2012 at 6:57 AM, Terry Healy <[EMAIL PROTECTED]> wrote: >> Sorry, forgot to attach the trace: >> <code> >> 2012-05-18 09:54:45,355 INFO >> org.apache.hadoop.hdfs.server.common.Storage: Number of files = 128 >> 2012-05-18 09:54:45,379 ERROR >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem >> initialization failed. >> java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) >> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) >> 2012-05-18 09:54:45,380 ERROR >> org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.EOFException >> at java.io.DataInputStream.readFully(DataInputStream.java:180) >> at org.apache.hadoop.io.UTF8.readFields(UTF8.java:112) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.readString(FSImage.java:1808) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:901) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:824) >> at >> org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:372) >> at >> org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:100) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:388) >> at >> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.<init>(FSNamesystem.java:362) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:276) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:496) >> at >> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1279) >> at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1288) >> >> 2012-05-18 09:54:45,380 INFO >> org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG: Terry Healy / [EMAIL PROTECTED] Cyber Security Operations Brookhaven National Laboratory Building 515, Upton N.Y. 11973 |