-Re: could only be replicated to 0 nodes, instead of 1
Suresh Srinivas 2012-09-04, 17:05
- A datanode is typically kept free with up to 5 free blocks (HDFS block
size) of space.
- Disk space is used by mapreduce jobs to store temporary shuffle spills
also. This is what "dfs.datanode.du.reserved" is used to configure. The
configuration is available in hdfs-site.xml. If you have not configured it
then reserved space is 0. Not only mapreduce, other files also might take
up the disk space.
When these errors are thrown, please send the namenode web UI information.
It has storage related information in the cluster summary. That will help
On Tue, Sep 4, 2012 at 9:41 AM, Keith Wiley <[EMAIL PROTECTED]> wrote:
> I've been running up against the good old fashioned "replicated to 0
> nodes" gremlin quite a bit recently. My system (a set of processes
> interacting with hadoop, and of course hadoop itself) runs for a while (a
> day or so) and then I get plagued with these errors. This is a very simple
> system, a single node running pseudo-distributed. Obviously, the
> replication factor is implicitly 1 and the datanode is the same machine as
> the namenode. None of the typical culprits seem to explain the situation
> and I'm not sure what to do. I'm also not sure how I'm getting around it
> so far. I fiddle desperately for a few hours and things start running
> again, but that's not really a solution...I've tried stopping and
> restarting hdfs, but that doesn't seem to improve things.
> So, to go through the common suspects one by one, as quoted on
> • No DataNode instances being up and running. Action: look at the servers,
> see if the processes are running.
> I can interact with hdfs through the command line (doing directory
> listings for example). Furthermore, I can see that the relevant java
> processes are all running (NameNode, SecondaryNameNode, DataNode,
> JobTracker, TaskTracker).
> • The DataNode instances cannot talk to the server, through networking or
> Hadoop configuration problems. Action: look at the logs of one of the
> Obviously irrelevant in a single-node scenario. Anyway, like I said, I
> can perform basic hdfs listings, I just can't upload new data.
> • Your DataNode instances have no hard disk space in their configured data
> directories. Action: look at the dfs.data.dir list in the node
> configurations, verify that at least one of the directories exists, and is
> writeable by the user running the Hadoop processes. Then look at the logs.
> There's plenty of space, at least 50GB.
> • Your DataNode instances have run out of space. Look at the disk capacity
> via the Namenode web pages. Delete old files. Compress under-used files.
> Buy more disks for existing servers (if there is room), upgrade the
> existing servers to bigger drives, or add some more servers.
> Nope, 50GBs free, I'm only uploading a few KB at a time, maybe a few MB.
> • The reserved space for a DN (as set in dfs.datanode.du.reserved is
> greater than the remaining free space, so the DN thinks it has no free space
> I grepped all the files in the conf directory and couldn't find this
> parameter so I don't really know anything about it. At any rate, it seems
> rather esoteric, I doubt it is related to my problem. Any thoughts on this?
> • You may also get this message due to permissions, eg if JT can not
> create jobtracker.info on startup.
> Meh, like I said, the system basicaslly works...and then stops working.
> The only explanation that would really make sense in that context is
> running out of space...which isn't happening. If this were a permission
> error, or a configuration error, or anything weird like that, then the
> whole system would never get up and running in the first place.
> Why would a properly running hadoop system start exhibiting this error
> without running out of disk space? THAT's the real question on the table
> Any ideas?