Aaron Zimmerman 2012-12-26, 15:14
-Re: INFO org.apache.hadoop.mapred.TaskTracker: attempt_XXXX NaN%
The NaN is very suspicious, perhaps a bug - will need more information
But irrespective, are you sending periodic updates from your map/reduce code? The framework has the 10 minute timeout to avoid hung tasks, so the user code can report progress via the Reporter interface and avoid the task-failures.
+Vinod Kumar Vavilapalli
On Dec 26, 2012, at 7:14 AM, Aaron Zimmerman wrote:
> I'm new to hadoop, setting up a new cluster on hadoop 1.0.3 that currently
> only has 2 datanode/tasktrackers. I'll be adding more soon, but I'm worried
> about something being configured incorrectly. When I run a moderately
> expensive map reduce job (via pig), the job usually fails (though it does
> succeed 1/8 times or so).
> ERROR 2997: Unable to recreate exception from backed error: Task
> attempt_201212171952_0406_m_000020_3 failed to report status for 601
> seconds. Killing!
> Any time a job runs on the cluster, both task tracker logs output line after
> line of
> INFO org.apache.hadoop.mapred.TaskTracker:
> attempt_201212171952_0411_m_000000_0 NaN%, with different attempt
> Interspersed with these entries are lines like,
> org.apache.hadoop.mapred.TaskTracker: attempt_201212171952_0411_r_000000_0
> 0.1851852% reduce > copy (5 of 9 at 0.00 MB/s) >
> Which makes it look to me like some of the tasks are working, but some of
> the tasks just stall out, and perhaps they eventually timeout the entire
> So maybe my job is just to labor intensive for the cluster, but the task
> tracker log entry seems odd, like something is wrong. Why would it say
> NaN%? I know that I can extend the timeout allotment, but I'd rather not do
> that as a permanent solution. Is there any other config that I could
> update? Has anyone seen that task tracker line before? I can't find
> anything about it via Google, etc.
> Aaron Zimmerman
Aaron Zimmerman 2012-12-26, 19:47