Yesterday, I bounced my DFS cluster. We realized that "ulimit –u" was, in extreme cases, preventing the name node from creating threads. This had only started occurring within the last day or so. When I brought the name node back up, it had essentially been rolled back by one week, and I lost all changes which had been made since then.
There are a few other factors to consider.
1. I had 3 locations for dfs.name.dir — one local and two NFS. (I originally thought this was 2 local and one NFS when I set it up.) On 1/24, the day which we effectively rolled back to, the second NFS mount started showing as FAILED on dfshealth.jsp. We had seen this before without issue, so I didn't consider it critical.
2. When I brought the name node back up, because of discovering the above, I had changed dfs.name.dir to 2 local drives and one NFS, excluding the one which had failed.
Reviewing the name node log from the day with the NFS outage, I see:
2013-01-24 16:33:11,794 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Unable to sync edit log.
java.io.IOException: Input/output error
at sun.nio.ch.FileChannelImpl.force0(Native Method)
at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source)
at java.security.AccessController.doPrivileged(Native Method)
2013-01-24 16:33:11,794 WARN org.apache.hadoop.hdfs.server.common.Storage: Removing storage dir /rdisks/xxxxxxxxxxxxxx
Unfortunately, since I wasn't expecting anything terrible to happen, I didn't look too closely at the file system while the name node was down. When I brought it up, the time stamp on the previous checkpoint directory in the dfs.name.dir was right around the above error message. The current directory basically had an fsimage and an empty edits log with the current time stamps.
So: what happened? Should this failure have led to my data loss? I would have thought the local directory would be fine in this scenario. Did I have any other options for data recovery?
Suresh Srinivas 2013-02-05, 22:58