I think this issue looks like the jobtracker is running out of RAM, too.
"Lost task tracker" is indicative of what I call the "million mapper
march". Lots of tasks running per user, regularly, generating job history
that's running out of RAM. JT starts to swap and/or gc pause, TT heartbeats
get dropped on the floor or delayed, tasks get rescheduled. If you
restarted the JT when you made these changes, you might have just masked
Things to make sure you try, in descending order of importance:
1. Make sure your job doesn't have too many tasks. The 20TB of data spread
over a tens to thousand files is going to spawn fewer tasks than the same
task spread over millions of files.
2. Make sure your job tracker isn't getting clogged up with history of
these big tasks. Lower mapred.jobtracker.completeuserjobs.maximum from its
default of 100 down to about 10 or so.
3. Increase the heap allocated to the job tracker.
On Fri, Jan 4, 2013 at 10:04 AM, Royston Sellman <
[EMAIL PROTECTED]> wrote:
> Hi Bobby,
> I suspected OOM problems. Just before I got your last message I made > some config changes: First I discovered that I was setting a property > called mapreduce.child.java.opts in my mapred-site.xml and apparently > this property is deprecated. I edited it to set > mapreduce.map.child.java.opts =3D -Xmx1024m and > mapreduce.reduce.child.java.opts =3D -Xmx1024m. I also edited > hadoop-env.sh so that HADOOP_NAMENODE_OPTS has -Xmx4g and > HADOOP_DATANODE_OPTS has -Xmx1024m.=20
> Not very scientific to change these all at once but I'm in a hurry.
> The job is now running 25% faster than before and log files are growing > more slowly. No errors and just three failed tasks (all of which worked > second try) after 3.5 hours.
> Using top I can see that no java processes are swapping or going mad but > I can see that some have swapped in the past so maybe the changes have > made a difference
> I'll keep checking over the weekend. Should know whether it's going to > work by tomorrow a.m. (London time).
> Thanks for your help. I'll do my best to report what I did if I resolve > this problem. The Internet is full of unresolveds.
> On 4 Jan 2013, at 15:16, Robert Evans <[EMAIL PROTECTED]> wrote:
> > This really should be on the user list so I am moving it over there.
> > It is probably something about the OS that is killing it. The only thing
> > that I know of on stock Linux that would do this is the Out of Memory
> > Killer. Do you have swap enabled on these boxes? You should check the
> > OOM killer logs, and if that is the case reset the box.
> > --Bobby
> > On 1/4/13 9:02 AM, "Royston Sellman" <[EMAIL PROTECTED]>
> > wrote:
> >> Hi Bobby,
> >> Thanks for the response Bobby,
> >> The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log"
> >> contained the log messages included in our previous message. It seems to
> >> show a series of successful map attempts with a few reduce attempts
> >> interspersed, then it gets to a point and only shows a series of reduce
> >> attempts that appear to be stuck at the same level of progress, before
> >> outputting the 143 exit code and the interrupted sleep message at the
> >> There is nothing in the tasktracker~.out files...
> >> The machines did not go down but the affected TTs did not log anything
> >> till I got up in the morning, saw the job had frozen, did stop-all.sh.
> >> Then the stalled TTs logged the shutdown.
> >> The disks are not full (67% usage across 12 disks per worker).
> >> It seems that the 143 exit code indicates that an external process has
> >> terminated our tasktracker JVM. Is this correct?
> >> If so, what would the likely suspects be that would terminate our
> >> tasktrackers? Is it possible this is related to our operating system
> >> (Scientific Linux 6) and PAM limits?
> >> We had already increased our hard limit on the number of open files for