|
|
-
Out of space preventing namenode startup
Patrick Marchwiak 2010-10-06, 20:56
While I was copying files to hdfs, the hadoop fs client started to report errors. Digging into the datanode logs revealed [1] that I had run out of space on one of my datanodes. The namenode (running on the same machine as the failed datanode) died with a fatal error [2] when this happened and the logs seem to indicate some kind of corruption. I am unable to start up my namenode now due to the current state of hdfs [3].
I stumbled upon HDFS-1378 which implies that manual editing of edit logs must be done to recover from this. How would one go about doing this? Are there any other options? Is this expected to happen when a datanode runs out of space during a copy? I'm not against wiping clean the data directories of each datanode and reformatting the namenode, if necessary.
One other part of this scenario that I can't explain is why data was being written to this node in the first place. This machine was not listed in the slaves file yet it was still being treated as a datanode. I realize now that the datanode daemon should not have been started on this machine but I would imagine that it would be ignored by the client if it was not in the configuration.
I'm running CDH3b2.
Thanks, Patrick [1] datanode log when space ran out:
2010-10-06 10:30:22,995 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block blk_-5413202144274811562_223793 src: /128.115.210.46:34712 dest: /128.115.210.46:50010 2010-10-06 10:30:23,599 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: exception: java.io.IOException: No space left on device at java.io.FileOutputStream.writeBytes(Native Method) at java.io.FileOutputStream.write(FileOutputStream.java:260) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118) 2010-10-06 10:30:23,617 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in receiveBlock for block blk_-5413202144274811562_223793 org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space left on device
[2] namenode log after space ran out:
2010-10-06 10:31:03,675 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit log. Fatal Error. 2010-10-06 10:31:03,675 FATAL org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All storage directories are inaccessible.
[3] namenode log error during startup: 2010-10-06 10:46:35,889 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: Incorrect data format. logVersion is -18 but writables.length is 0. at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556) ....
-
Re: Out of space preventing namenode startup
Allen Wittenauer 2010-10-06, 21:04
Given this is the third time this has come up in the past two days, I guess we need a new FAQ entry or three.
We also clearly need to update the quickstart that says:
a) Do not run a datanode on the namenode. b) Make sure dfs.name.dir has two entries, one on a remote box. c) The slaves files has nothing to do with what nodes are in the HDFS. On Oct 6, 2010, at 1:56 PM, Patrick Marchwiak wrote:
> While I was copying files to hdfs, the hadoop fs client started to > report errors. Digging into the datanode logs revealed [1] that I had > run out of space on one of my datanodes. The namenode (running on the > same machine as the failed datanode) died with a fatal error [2] when > this happened and the logs seem to indicate some kind of corruption. I > am unable to start up my namenode now due to the current state of hdfs > [3]. > > I stumbled upon HDFS-1378 which implies that manual editing of edit > logs must be done to recover from this. How would one go about doing > this? Are there any other options? Is this expected to happen when a > datanode runs out of space during a copy? I'm not against wiping clean > the data directories of each datanode and reformatting the namenode, > if necessary. > > One other part of this scenario that I can't explain is why data was > being written to this node in the first place. This machine was not > listed in the slaves file yet it was still being treated as a > datanode. I realize now that the datanode daemon should not have been > started on this machine but I would imagine that it would be ignored > by the client if it was not in the configuration. > > I'm running CDH3b2. > > Thanks, > Patrick > > > [1] datanode log when space ran out: > > 2010-10-06 10:30:22,995 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-5413202144274811562_223793 src: /128.115.210.46:34712 dest: > /128.115.210.46:50010 > 2010-10-06 10:30:23,599 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: > exception: > java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118) > 2010-10-06 10:30:23,617 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > receiveBlock for block blk_-5413202144274811562_223793 > org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space > left on device > > [2] namenode log after space ran out: > > 2010-10-06 10:31:03,675 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync > edit log. Fatal Error. > 2010-10-06 10:31:03,675 FATAL > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All > storage directories are inaccessible. > > [3] namenode log error during startup: > 2010-10-06 10:46:35,889 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem > initialization failed. > java.io.IOException: Incorrect data format. logVersion is -18 but > writables.length is 0. > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556) > ....
-
Re: Out of space preventing namenode startup
Shrijeet Paliwal 2010-10-06, 21:07
> > One other part of this scenario that I can't explain is why data was
being written to this node in the first place. This machine was not
listed in the slaves file yet it was still being treated as a
datanode.
Doesnt matter if it was listed in salves file. Was data node running on that node?
I realize now that the datanode daemon should not have been
started on this machine but I would imagine that it would be ignored
by the client if it was not in the configuration.
Oh yes. It was running. Its not ignored if its not mentioned in slaves file.
Digg into hdfs-user mails sent last week and this week. Couple of similar issues were reported. They have a solution.
On Wed, Oct 6, 2010 at 1:56 PM, Patrick Marchwiak <[EMAIL PROTECTED]> wrote: > > While I was copying files to hdfs, the hadoop fs client started to > report errors. Digging into the datanode logs revealed [1] that I had > run out of space on one of my datanodes. The namenode (running on the > same machine as the failed datanode) died with a fatal error [2] when > this happened and the logs seem to indicate some kind of corruption. I > am unable to start up my namenode now due to the current state of hdfs > [3]. > > I stumbled upon HDFS-1378 which implies that manual editing of edit > logs must be done to recover from this. How would one go about doing > this? Are there any other options? Is this expected to happen when a > datanode runs out of space during a copy? I'm not against wiping clean > the data directories of each datanode and reformatting the namenode, > if necessary. > > One other part of this scenario that I can't explain is why data was > being written to this node in the first place. This machine was not > listed in the slaves file yet it was still being treated as a > datanode. I realize now that the datanode daemon should not have been > started on this machine but I would imagine that it would be ignored > by the client if it was not in the configuration. > > I'm running CDH3b2. > > Thanks, > Patrick > > > [1] datanode log when space ran out: > > 2010-10-06 10:30:22,995 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block > blk_-5413202144274811562_223793 src: /128.115.210.46:34712 dest: > /128.115.210.46:50010 > 2010-10-06 10:30:23,599 WARN > org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError: > exception: > java.io.IOException: No space left on device > at java.io.FileOutputStream.writeBytes(Native Method) > at java.io.FileOutputStream.write(FileOutputStream.java:260) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453) > at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377) > at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118) > 2010-10-06 10:30:23,617 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in > receiveBlock for block blk_-5413202144274811562_223793 > org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space > left on device > > [2] namenode log after space ran out: > > 2010-10-06 10:31:03,675 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync > edit log. Fatal Error. > 2010-10-06 10:31:03,675 FATAL > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All > storage directories are inaccessible. > > [3] namenode log error during startup: > 2010-10-06 10:46:35,889 ERROR > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem > initialization failed. > java.io.IOException: Incorrect data format. logVersion is -18 but > writables.length is 0. > at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556) > ....
|
|