I suspected OOM problems. Just before I got your last message I made some config changes: First I discovered that I was setting a property called mapreduce.child.java.opts in my mapred-site.xml and apparently this property is deprecated. I edited it to set mapreduce.map.child.java.opts =3D -Xmx1024m and mapreduce.reduce.child.java.opts =3D -Xmx1024m. I also edited hadoop-env.sh so that HADOOP_NAMENODE_OPTS has -Xmx4g and HADOOP_DATANODE_OPTS has -Xmx1024m.=20
Not very scientific to change these all at once but I'm in a hurry.
The job is now running 25% faster than before and log files are growing more slowly. No errors and just three failed tasks (all of which worked second try) after 3.5 hours.
Using top I can see that no java processes are swapping or going mad but I can see that some have swapped in the past so maybe the changes have made a difference
I'll keep checking over the weekend. Should know whether it's going to work by tomorrow a.m. (London time).
Thanks for your help. I'll do my best to report what I did if I resolve this problem. The Internet is full of unresolveds.
On 4 Jan 2013, at 15:16, Robert Evans <[EMAIL PROTECTED]> wrote:
> This really should be on the user list so I am moving it over there.
> It is probably something about the OS that is killing it. The only thing
> that I know of on stock Linux that would do this is the Out of Memory
> Killer. Do you have swap enabled on these boxes? You should check the
> OOM killer logs, and if that is the case reset the box.
> On 1/4/13 9:02 AM, "Royston Sellman" <[EMAIL PROTECTED]>
>> Hi Bobby,
>> Thanks for the response Bobby,
>> The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log"
>> contained the log messages included in our previous message. It seems to
>> show a series of successful map attempts with a few reduce attempts
>> interspersed, then it gets to a point and only shows a series of reduce
>> attempts that appear to be stuck at the same level of progress, before
>> outputting the 143 exit code and the interrupted sleep message at the end.
>> There is nothing in the tasktracker~.out files...
>> The machines did not go down but the affected TTs did not log anything
>> till I got up in the morning, saw the job had frozen, did stop-all.sh.
>> Then the stalled TTs logged the shutdown.
>> The disks are not full (67% usage across 12 disks per worker).
>> It seems that the 143 exit code indicates that an external process has
>> terminated our tasktracker JVM. Is this correct?
>> If so, what would the likely suspects be that would terminate our
>> tasktrackers? Is it possible this is related to our operating system
>> (Scientific Linux 6) and PAM limits?
>> We had already increased our hard limit on the number of open files for
>> the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the
>> hope that this would solve the issue. Can you see anything wrong with our
>> security limits:
>> [hdfs@hd-37-03 hdfs]$ ulimit -aH
>> core file size (blocks, -c) 0
>> data seg size (kbytes, -d) unlimited
>> scheduling priority (-e) 0
>> file size (blocks, -f) unlimited
>> pending signals (-i) 191988
>> max locked memory (kbytes, -l) 64
>> max memory size (kbytes, -m) unlimited
>> open files (-n) 32768
>> pipe size (512 bytes, -p) 8
>> POSIX message queues (bytes, -q) 819200
>> real-time priority (-r) 0
>> stack size (kbytes, -s) unlimited
>> cpu time (seconds, -t) unlimited
>> max user processes (-u) unlimited
>> virtual memory (kbytes, -v) unlimited
>> file locks (-x) unlimited
>> Thanks for your help.
>> On 4 Jan 2013, at 14:34, Robert Evans <[EMAIL PROTECTED]> wrote:
>>> Is there anything in the task tracker's logs? Did the machines go down?