Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # general >> Lost tasktracker errors

Copy link to this message
Re: Lost tasktracker errors
Hi Bobby,

I suspected OOM problems. Just before I got your last message I made some config changes: First I discovered that I was setting a property called mapreduce.child.java.opts in my mapred-site.xml and apparently this property is deprecated. I edited it to set mapreduce.map.child.java.opts =3D -Xmx1024m and mapreduce.reduce.child.java.opts =3D -Xmx1024m. I also edited hadoop-env.sh so that HADOOP_NAMENODE_OPTS has -Xmx4g and HADOOP_DATANODE_OPTS has -Xmx1024m.=20

Not very scientific to change these all at once but I'm in a hurry.

The job is now running 25% faster than before and log files are growing more slowly. No errors and just three failed tasks (all of which worked second try) after 3.5 hours.

Using top I can see that no java processes are swapping or going mad but I can see that some have swapped in the past so maybe the changes have made a difference

I'll keep checking over the weekend. Should know whether it's going to work by tomorrow a.m. (London time).

Thanks for your help. I'll do my best to report what I did if I resolve this problem. The Internet is full of unresolveds.


On 4 Jan 2013, at 15:16, Robert Evans <[EMAIL PROTECTED]> wrote:

> This really should be on the user list so I am moving it over there.
> It is probably something about the OS that is killing it.  The only thing
> that I know of on stock Linux that would do this is the Out of Memory
> Killer.  Do you have swap enabled on these boxes?  You should check the
> OOM killer logs, and if that is the case reset the box.
> --Bobby
> On 1/4/13 9:02 AM, "Royston Sellman" <[EMAIL PROTECTED]>
> wrote:
>> Hi Bobby,
>> Thanks for the response Bobby,
>> The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log"
>> contained the log messages included in our previous message. It seems to
>> show a series of successful map attempts with a few reduce attempts
>> interspersed, then it gets to a point and only shows a series of reduce
>> attempts that appear to be stuck at the same level of progress, before
>> outputting the 143 exit code and the interrupted sleep message at the end.
>> There is nothing in the tasktracker~.out files...
>> The machines did not go down but the affected TTs did not log anything
>> till I got up in the morning, saw the job had frozen, did stop-all.sh.
>> Then the stalled TTs logged the shutdown.
>> The disks are not full (67% usage across 12 disks per worker).
>> It seems that the 143 exit code indicates that an external process has
>> terminated our tasktracker JVM. Is this correct?
>> If so, what would the likely suspects be that would terminate our
>> tasktrackers? Is it possible this is related to our operating system
>> (Scientific Linux 6) and PAM limits?
>> We had already increased our hard limit on the number of open files for
>> the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the
>> hope that this would solve the issue. Can you see anything wrong with our
>> security limits:
>> [hdfs@hd-37-03 hdfs]$ ulimit -aH
>> core file size          (blocks, -c) 0
>> data seg size           (kbytes, -d) unlimited
>> scheduling priority             (-e) 0
>> file size               (blocks, -f) unlimited
>> pending signals                 (-i) 191988
>> max locked memory       (kbytes, -l) 64
>> max memory size         (kbytes, -m) unlimited
>> open files                      (-n) 32768
>> pipe size            (512 bytes, -p) 8
>> POSIX message queues     (bytes, -q) 819200
>> real-time priority              (-r) 0
>> stack size              (kbytes, -s) unlimited
>> cpu time               (seconds, -t) unlimited
>> max user processes              (-u) unlimited
>> virtual memory          (kbytes, -v) unlimited
>> file locks                      (-x) unlimited
>> Thanks for your help.
>> Royston
>> On 4 Jan 2013, at 14:34, Robert Evans <[EMAIL PROTECTED]> wrote:
>>> Is there anything in the task tracker's logs?  Did the machines go down?