Re: Lost tasktracker errors
Hi Bobby,

Thanks for the response Bobby,

The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" contained the log messages included in our previous message. It seems to show a series of successful map attempts with a few reduce attempts interspersed, then it gets to a point and only shows a series of reduce attempts that appear to be stuck at the same level of progress, before outputting the 143 exit code and the interrupted sleep message at the end.

There is nothing in the tasktracker~.out files...

The machines did not go down but the affected TTs did not log anything till I got up in the morning, saw the job had frozen, did stop-all.sh. Then the stalled TTs logged the shutdown.

The disks are not full (67% usage across 12 disks per worker).

It seems that the 143 exit code indicates that an external process has terminated our tasktracker JVM. Is this correct?

If so, what would the likely suspects be that would terminate our tasktrackers? Is it possible this is related to our operating system (Scientific Linux 6) and PAM limits?

We had already increased our hard limit on the number of open files for the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the hope that this would solve the issue. Can you see anything wrong with our security limits:

[hdfs@hd-37-03 hdfs]$ ulimit -aH
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 191988
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 32768
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) unlimited
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

Thanks for your help.


On 4 Jan 2013, at 14:34, Robert Evans <[EMAIL PROTECTED]> wrote:

> Is there anything in the task tracker's logs?  Did the machines go down?
> Are there full disks on those nodes?
> --Bobby
> On 1/4/13 5:52 AM, "Royston Sellman" <[EMAIL PROTECTED]>
> wrote:
>> I'm running a job over a 380 billion row 20 TB dataset which is computing
>> sum(), max() etc. The job is running fine at around 3 million rows per
>> second for several hours then grinding to a halt as it loses one after
>> another of the tasktrackers.  We see a healthy mix of successful map and
>> reduce attempts on the tasktracker...
>> 2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041109_0 1.0%
>> 2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041105_0 1.0%
>> 2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041105_0 1.0%
>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task
>> attempt_201301031813_0001_m_041105_0 is done.
>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker:
>> reported
>> output size for attempt_201301031813_0001_m_041105_0  was 111
>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker:
>> addFreeSlot : current free slots : 8
>> 2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041106_0 0.9884119%
>> 2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM :
>> jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of
>> tasks
>> it ran: 1
>> 2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041103_0 0.9884134%
>> 2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker:
>> attempt_201301031813_0001_m_041101_0 1.0%