Aaron Zimmerman 2012-12-26, 15:14
Vinod Kumar Vavilapalli 2012-12-26, 19:21
-Re: INFO org.apache.hadoop.mapred.TaskTracker: attempt_XXXX NaN%
What information can I get you?
I'm not sending periodic updates, I'm just using pig to generate the map
reduce jobs, so perhaps something I'm doing is causing the pig not to be
able to report?
The map task itself is querying Hbase, essentially a full table scan, the
table currently has 9 regions, so 9 map tasks.
I don't see that line in the task logs, just in the task tracker as they
are performing the task.
It looks to me like the tasks that are coming back progress NaN eventually
fail and are retried, and the retry succeeds.
http://hastebin.com/xuhugipuve.sm - task fails
http://hastebin.com/vusadavano.vhdl - task succeeds
On 12/26/12 1:21 PM, "Vinod Kumar Vavilapalli" <[EMAIL PROTECTED]>
>The NaN is very suspicious, perhaps a bug - will need more information
>But irrespective, are you sending periodic updates from your map/reduce
>code? The framework has the 10 minute timeout to avoid hung tasks, so the
>user code can report progress via the Reporter interface and avoid the
>+Vinod Kumar Vavilapalli
>On Dec 26, 2012, at 7:14 AM, Aaron Zimmerman wrote:
>> I'm new to hadoop, setting up a new cluster on hadoop 1.0.3 that
>> only has 2 datanode/tasktrackers. I'll be adding more soon, but I'm
>> about something being configured incorrectly. When I run a moderately
>> expensive map reduce job (via pig), the job usually fails (though it
>> succeed 1/8 times or so).
>> ERROR 2997: Unable to recreate exception from backed error: Task
>> attempt_201212171952_0406_m_000020_3 failed to report status for 601
>> seconds. Killing!
>> Any time a job runs on the cluster, both task tracker logs output line
>> line of
>> INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201212171952_0411_m_000000_0 NaN%, with different attempt
>> Interspersed with these entries are lines like,
>> 0.1851852% reduce > copy (5 of 9 at 0.00 MB/s) >
>> Which makes it look to me like some of the tasks are working, but some
>> the tasks just stall out, and perhaps they eventually timeout the entire
>> So maybe my job is just to labor intensive for the cluster, but the task
>> tracker log entry seems odd, like something is wrong. Why would it say
>> NaN%? I know that I can extend the timeout allotment, but I'd rather
>> that as a permanent solution. Is there any other config that I could
>> update? Has anyone seen that task tracker line before? I can't find
>> anything about it via Google, etc.
>> Aaron Zimmerman