Given this is the third time this has come up in the past two days, I guess we need a new FAQ entry or three.
We also clearly need to update the quickstart that says:
a) Do not run a datanode on the namenode.
b) Make sure dfs.name.dir has two entries, one on a remote box.
c) The slaves files has nothing to do with what nodes are in the HDFS.
On Oct 6, 2010, at 1:56 PM, Patrick Marchwiak wrote:
> While I was copying files to hdfs, the hadoop fs client started to
> report errors. Digging into the datanode logs revealed  that I had
> run out of space on one of my datanodes. The namenode (running on the
> same machine as the failed datanode) died with a fatal error  when
> this happened and the logs seem to indicate some kind of corruption. I
> am unable to start up my namenode now due to the current state of hdfs
> I stumbled upon HDFS-1378 which implies that manual editing of edit
> logs must be done to recover from this. How would one go about doing
> this? Are there any other options? Is this expected to happen when a
> datanode runs out of space during a copy? I'm not against wiping clean
> the data directories of each datanode and reformatting the namenode,
> if necessary.
> One other part of this scenario that I can't explain is why data was
> being written to this node in the first place. This machine was not
> listed in the slaves file yet it was still being treated as a
> datanode. I realize now that the datanode daemon should not have been
> started on this machine but I would imagine that it would be ignored
> by the client if it was not in the configuration.
> I'm running CDH3b2.
>  datanode log when space ran out:
> 2010-10-06 10:30:22,995 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_-5413202144274811562_223793 src: /18.104.22.168:34712 dest:
> 2010-10-06 10:30:23,599 WARN
> org.apache.hadoop.hdfs.server.datanode.DataNode: checkDiskError:
> java.io.IOException: No space left on device
> at java.io.FileOutputStream.writeBytes(Native Method)
> at java.io.FileOutputStream.write(FileOutputStream.java:260)
> at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:453)
> at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:532)
> at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:377)
> at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:118)
> 2010-10-06 10:30:23,617 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Exception in
> receiveBlock for block blk_-5413202144274811562_223793
> org.apache.hadoop.util.DiskChecker$DiskOutOfSpaceException: No space
> left on device
>  namenode log after space ran out:
> 2010-10-06 10:31:03,675 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync
> edit log. Fatal Error.
> 2010-10-06 10:31:03,675 FATAL
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Fatal Error : All
> storage directories are inaccessible.
>  namenode log error during startup:
> 2010-10-06 10:46:35,889 ERROR
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem
> initialization failed.
> java.io.IOException: Incorrect data format. logVersion is -18 but
> writables.length is 0.
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:556)