Just an FYI, found the solution to this problem.
Apparently, it's an OS limit on the number of sub-directories that can be created in another directory. In this case, we had 31998 sub-directories under hadoop/userlogs/, so any new tasks would fail in Job Setup.
>From the unix command line, mkdir fails as well:
$ mkdir hadoop/userlogs/testdir
mkdir: cannot create directory `hadoop/userlogs/testdir': Too many links
Difficult to track down because the Hadoop error message gives no hint whatsoever. And normally, you'd look in the userlog itself for more info, but in this case the userlog couldn't be created.
From: Marc Limotte
Sent: Wednesday, September 23, 2009 11:06 AM
To: '[EMAIL PROTECTED]'
Subject: Task process exit with nonzero status of 1
I'm seeing this error when I try to run my job.
java.io.IOException: Task process exit with nonzero status of 1.
>From what I can find by doing some Google searches, this means the mapred task JVM has crashed. Not many suggestions about what to do about it. Some suggestions about increasing max heap. I tried that, although I don't think that's the issue because it's not a particularly memory intensive process and I've even tried it with a super small input data set of only a few records. Still see the same issue.
Can't find anything else in the logs. I don't think my task even started, because there are no user logs created at all. Seems to fail during Job Setup.
A little more background. This job was working fine for weeks, running hourly, and then failed on Saturday morning and hasn't worked since. Obviously, I looked for something that changed at that point, but no one was working at that time... can't find anything that changed. I tried the job with different input data sets, doesn't seem to matter, unless I run it with no data at all. The job does run with no input data, but if I have even a few input records it fails-doesn't seem to matter which records. I suspected some corruption in HDFS, but I was able to extract the data from HDFS (hadoop dfs -get ...) and the data looks ok. I also copied this data set to our TEST cluster and ran the job there... and it WORKED!
Ran one of our other jobs and it failed as well, so it doesn't seem to be job specific either; looks like every job fails the same way.
Did a complete reboot of the cluster-no impact.
We're using Hadoop 0.20.0, and Java 1.6 update 16 on CentOS 5.2 64bit.
Any suggestions on what could be wrong or where to look for more information would be appreciated.
PRIVATE AND CONFIDENTIAL - NOTICE TO RECIPIENT: THIS E-MAIL IS MEANT FOR ONLY THE INTENDED RECIPIENT OF THE TRANSMISSION, AND MAY BE A COMMUNICATION PRIVILEGE BY LAW. IF YOU RECEIVED THIS E-MAIL IN ERROR, ANY REVIEW, USE, DISSEMINATION, DISTRIBUTION, OR COPYING OF THIS EMAIL IS STRICTLY PROHIBITED. PLEASE NOTIFY US IMMEDIATELY OF THE ERROR BY RETURN E-MAIL AND PLEASE DELETE THIS MESSAGE FROM YOUR SYSTEM.
Johannes Zillmann 2010-06-14, 17:15