We recently experienced a couple of situations that brought one or more
Hadoop nodes down (unresponsive). One was related to a bug in a
utility we use (ffmpeg) that was resolved by compiling a new version.
The next, today, occurred after attempting to join a new node to the
A basic start of the (local) tasktracker and datanode did not work -- so
based on reference, I issued: hadoop mradmin -refreshNodes, which was to
be followed by hadoop dfsadmin -refreshNodes. The load average
literally jumped to 60 and the master (which also runs a slave) became
Seems to me that this should never happen. But, looking around, I saw
an article from Spotify which mentioned the need to set certain resource
limits on the JVM as well as in the system itself (limits.conf, we run
RHEL). I (and we) are fairly new to Hadoop, so some of these issues
are very new.
I wonder if some of the experts here might be able to comment on this
issue - perhaps point out settings and other measures we can take to
prevent this sort of incident in the future.
Our setup is not complicated. Have 3 hadoop nodes, the first is also a
master and a slave (has more resources, too). The underlying system we
do is split up tasks to ffmpeg (which is another issue as it tends to
eat resources, but so far with a recompile, we are good). We have two
more hardware nodes to add shortly.