This is one issue that lately we have seen quite frequently. We have a 10
data node cluster with each data node running with 1 GB memory. Total disk
space is about 17 TB out of which 12 TB are full.
Each of the datanodes have 4 disks attached to them which we have defined
in the hdfs-site.xml.
The writes on the cluster are pretty heavy. We run hbase, mapreduce jobs
and direct hdfs writes as well.
The errors we frequently get are:
java.io.IOException: All datanodes xxx.xxx.xxx.xxx:50010
<http://10.181.4.240:50010/> are bad. Aborting...
We looked into datanode logs but did not find anything wrong there.
Also, we have an imbalance of disk space on a couple of datanodes where one
of the disks is 100 % full. Could that be an issue?
We have tried increasing the syslimit params and the xceivers. But the
issue keeps coming back. Anything in the conf that we are missing. Or is
the load just too high?
If anybody faced this issue and has successfully resolved it, please help
us out with suggestions.
Please let us know if there is something we need to look into here.