HDFS, mail # user - Out of space preventing namenode startup

Patrick Marchwiak 2010-10-06, 20:56
Re: Out of space preventing namenode startup
Allen Wittenauer 2010-10-06, 21:04

Given this is the third time this has come up in the past two days, I guess we need a new FAQ entry or three.

We also clearly need to update the quickstart that says:

a) Do not run a datanode on the namenode.
b) Make sure dfs.name.dir has two entries, one on a remote box.
c) The slaves files has nothing to do with what nodes are in the HDFS.
On Oct 6, 2010, at 1:56 PM, Patrick Marchwiak wrote:

> While I was copying files to hdfs, the hadoop fs client started to
> report errors. Digging into the datanode logs revealed [1] that I had
> run out of space on one of my datanodes. The namenode (running on the
> same machine as the failed datanode) died with a fatal error [2] when
> this happened and the logs seem to indicate some kind of corruption. I
> am unable to start up my namenode now due to the current state of hdfs
> [3].
> I stumbled upon HDFS-1378 which implies that manual editing of edit
> logs must be done to recover from this. How would one go about doing
> this? Are there any other options? Is this expected to happen when a
> datanode runs out of space during a copy? I'm not against wiping clean
> the data directories of each datanode and reformatting the namenode,
> if necessary.
> One other part of this scenario that I can't explain is why data was
> being written to this node in the first place. This machine was not
> listed in the slaves file yet it was still being treated as a
> datanode. I realize now that the datanode daemon should not have been
> started on this machine but I would imagine that it would be ignored
> by the client if it was not in the configuration.
> I'm running CDH3b2.
> Thanks,
> Patrick
> [1] datanode log when space ran out:
> 2010-10-06 10:30:22,995 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-5413202144274811562_223793 src: / dest:
> /
> 2010-10-06 10:30:23,599 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
> exception:
> java.io.IOException: No space left on device
>        at java.io.FileOutputStream.writeBytes(Native Method)
>        at java.io.FileOutputStream.write(FileOutputStream.java:260)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
>        at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377)
>        at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118)
> 2010-10-06 10:30:23,617 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_-5413202144274811562_223793
> org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space
> left on device
> [2] namenode log after space ran out:
> 2010-10-06 10:31:03,675 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync
> edit log. Fatal Error.
> 2010-10-06 10:31:03,675 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
> storage directories are inaccessible.
> [3] namenode log error during startup:
> 2010-10-06 10:46:35,889 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.IOException: Incorrect data format. logVersion is -18 but
> writables.length is 0.
>        at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556)
> ....
