|
|
-
Lost tasktracker errors
Royston Sellman 2013-01-04, 11:52
I'm running a job over a 380 billion row 20 TB dataset which is computing sum(), max() etc. The job is running fine at around 3 million rows per second for several hours then grinding to a halt as it loses one after another of the tasktrackers. We see a healthy mix of successful map and reduce attempts on the tasktracker...
2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041109_0 1.0%
2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041105_0 1.0%
2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041105_0 1.0%
2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041105_0 is done.
2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041105_0 was 111
2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 8
2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041106_0 0.9884119%
2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of tasks it ran: 1
2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041103_0 0.9884134%
2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041101_0 1.0%
2013-01-03 23:41:43,193 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041101_0 1.0%
2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041101_0 is done.
2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041101_0 was 111
2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 9
2013-01-03 23:41:43,303 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041109_0 1.0%
2013-01-03 23:41:43,306 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041109_0 1.0%
2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041109_0 is done.
2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041109_0 was 111
2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 10
2013-01-03 23:41:43,361 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_36690963 exited with exit code 0. Number of tasks it ran: 1
2013-01-03 23:41:43,428 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041106_0 1.0%
2013-01-03 23:41:43,432 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041106_0 1.0%
2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041106_0 is done.
2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041106_0 was 111
2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 11
2013-01-03 23:41:43,457 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_-2095784622 exited with exit code 0. Number of tasks it ran: 1
2013-01-03 23:41:43,595 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_1190449426 exited with exit code 0. Number of tasks it ran: 1
2013-01-03 23:41:43,862 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041103_0 1.0%
2013-01-03 23:41:43,866 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_m_041103_0 1.0%
2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: Task attempt_201301031813_0001_m_041103_0 is done.
2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: reported output size for attempt_201301031813_0001_m_041103_0 was 111
2013-01-03 23:41:43,867 INFO org.apache.hadoop.mapred.TaskTracker: addFreeSlot : current free slots : 12
2013-01-03 23:41:44,021 INFO org.apache.hadoop.mapred.JvmManager: JVM : jvm_201301031813_0001_m_-505301168 exited with exit code 0. Number of tasks it ran: 1
2013-01-03 23:41:45,539 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050235882% reduce > copy (40975 of 271884 at 0.00 MB/s) >
Then it seems to get stuck on reduce attempts, before exiting with a SIG_TERM (143) exit code...
2013-01-03 23:50:49,642 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:50:55,678 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:01,717 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:04,755 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:10,796 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:16,831 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:19,870 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:25,911 INFO org.apache.hadoop.mapred.TaskTracker: attempt_201301031813_0001_r_000000_0 0.050278794% reduce > copy (41010 of 271884 at 0.00 MB/s) >
2013-01-03 23:51:31,953
+
Royston Sellman 2013-01-04, 11:52
-
Re: Lost tasktracker errors
Robert Evans 2013-01-04, 14:34
Is there anything in the task tracker's logs? Did the machines go down? Are there full disks on those nodes?
--Bobby
On 1/4/13 5:52 AM, "Royston Sellman" <[EMAIL PROTECTED]> wrote:
>I'm running a job over a 380 billion row 20 TB dataset which is computing >sum(), max() etc. The job is running fine at around 3 million rows per >second for several hours then grinding to a halt as it loses one after >another of the tasktrackers. We see a healthy mix of successful map and >reduce attempts on the tasktracker... > > > >2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041109_0 1.0% > >2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041105_0 1.0% > >2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041105_0 1.0% > >2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task >attempt_201301031813_0001_m_041105_0 is done. > >2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >reported >output size for attempt_201301031813_0001_m_041105_0 was 111 > >2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >addFreeSlot : current free slots : 8 > >2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041106_0 0.9884119% > >2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM : >jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of >tasks >it ran: 1 > >2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041103_0 0.9884134% > >2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041101_0 1.0% > >2013-01-03 23:41:43,193 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041101_0 1.0% > >2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: Task >attempt_201301031813_0001_m_041101_0 is done. > >2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: >reported >output size for attempt_201301031813_0001_m_041101_0 was 111 > >2013-01-03 23:41:43,194 INFO org.apache.hadoop.mapred.TaskTracker: >addFreeSlot : current free slots : 9 > >2013-01-03 23:41:43,303 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041109_0 1.0% > >2013-01-03 23:41:43,306 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041109_0 1.0% > >2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: Task >attempt_201301031813_0001_m_041109_0 is done. > >2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: >reported >output size for attempt_201301031813_0001_m_041109_0 was 111 > >2013-01-03 23:41:43,307 INFO org.apache.hadoop.mapred.TaskTracker: >addFreeSlot : current free slots : 10 > >2013-01-03 23:41:43,361 INFO org.apache.hadoop.mapred.JvmManager: JVM : >jvm_201301031813_0001_m_36690963 exited with exit code 0. Number of tasks >it >ran: 1 > >2013-01-03 23:41:43,428 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041106_0 1.0% > >2013-01-03 23:41:43,432 INFO org.apache.hadoop.mapred.TaskTracker: >attempt_201301031813_0001_m_041106_0 1.0% > >2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: Task >attempt_201301031813_0001_m_041106_0 is done. > >2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: >reported >output size for attempt_201301031813_0001_m_041106_0 was 111 > >2013-01-03 23:41:43,433 INFO org.apache.hadoop.mapred.TaskTracker: >addFreeSlot : current free slots : 11 > >2013-01-03 23:41:43,457 INFO org.apache.hadoop.mapred.JvmManager: JVM : >jvm_201301031813_0001_m_-2095784622 exited with exit code 0. Number of >tasks >it ran: 1 > >2013-01-03 23:41:43,595 INFO org.apache.hadoop.mapred.JvmManager: JVM : >jvm_201301031813_0001_m_1190449426 exited with exit code 0. Number of >tasks >it ran: 1 > >2013-01-03 23:41:43,862 INFO org.apache.hadoop.mapred.TaskTracker:
+
Robert Evans 2013-01-04, 14:34
-
Re: Lost tasktracker errors
Royston Sellman 2013-01-04, 15:02
Hi Bobby,
Thanks for the response Bobby,
The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" contained the log messages included in our previous message. It seems to show a series of successful map attempts with a few reduce attempts interspersed, then it gets to a point and only shows a series of reduce attempts that appear to be stuck at the same level of progress, before outputting the 143 exit code and the interrupted sleep message at the end.
There is nothing in the tasktracker~.out files...
The machines did not go down but the affected TTs did not log anything till I got up in the morning, saw the job had frozen, did stop-all.sh. Then the stalled TTs logged the shutdown.
The disks are not full (67% usage across 12 disks per worker).
It seems that the 143 exit code indicates that an external process has terminated our tasktracker JVM. Is this correct?
If so, what would the likely suspects be that would terminate our tasktrackers? Is it possible this is related to our operating system (Scientific Linux 6) and PAM limits?
We had already increased our hard limit on the number of open files for the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the hope that this would solve the issue. Can you see anything wrong with our security limits:
[hdfs@hd-37-03 hdfs]$ ulimit -aH core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 191988 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 32768 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) unlimited cpu time (seconds, -t) unlimited max user processes (-u) unlimited virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
Thanks for your help.
Royston
On 4 Jan 2013, at 14:34, Robert Evans <[EMAIL PROTECTED]> wrote:
> Is there anything in the task tracker's logs? Did the machines go down? > Are there full disks on those nodes? > > --Bobby > > On 1/4/13 5:52 AM, "Royston Sellman" <[EMAIL PROTECTED]> > wrote: > >> I'm running a job over a 380 billion row 20 TB dataset which is computing >> sum(), max() etc. The job is running fine at around 3 million rows per >> second for several hours then grinding to a halt as it loses one after >> another of the tasktrackers. We see a healthy mix of successful map and >> reduce attempts on the tasktracker... >> >> >> >> 2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker: >> attempt_201301031813_0001_m_041109_0 1.0% >> >> 2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker: >> attempt_201301031813_0001_m_041105_0 1.0% >> >> 2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker: >> attempt_201301031813_0001_m_041105_0 1.0% >> >> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task >> attempt_201301031813_0001_m_041105_0 is done. >> >> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >> reported >> output size for attempt_201301031813_0001_m_041105_0 was 111 >> >> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >> addFreeSlot : current free slots : 8 >> >> 2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker: >> attempt_201301031813_0001_m_041106_0 0.9884119% >> >> 2013-01-03 23:41:40,432 INFO org.apache.hadoop.mapred.JvmManager: JVM : >> jvm_201301031813_0001_m_2021872807 exited with exit code 0. Number of >> tasks >> it ran: 1 >> >> 2013-01-03 23:41:40,807 INFO org.apache.hadoop.mapred.TaskTracker: >> attempt_201301031813_0001_m_041103_0 0.9884134% >> >> 2013-01-03 23:41:43,190 INFO org.apache.hadoop.mapred.TaskTracker: >> attempt_201301031813_0001_m_041101_0 1.0% >> >
+
Royston Sellman 2013-01-04, 15:02
-
Re: Lost tasktracker errors
Robert Evans 2013-01-04, 15:16
This really should be on the user list so I am moving it over there.
It is probably something about the OS that is killing it. The only thing that I know of on stock Linux that would do this is the Out of Memory Killer. Do you have swap enabled on these boxes? You should check the OOM killer logs, and if that is the case reset the box.
--Bobby
On 1/4/13 9:02 AM, "Royston Sellman" <[EMAIL PROTECTED]> wrote:
>Hi Bobby, > >Thanks for the response Bobby, > >The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" >contained the log messages included in our previous message. It seems to >show a series of successful map attempts with a few reduce attempts >interspersed, then it gets to a point and only shows a series of reduce >attempts that appear to be stuck at the same level of progress, before >outputting the 143 exit code and the interrupted sleep message at the end. > >There is nothing in the tasktracker~.out files... > >The machines did not go down but the affected TTs did not log anything >till I got up in the morning, saw the job had frozen, did stop-all.sh. >Then the stalled TTs logged the shutdown. > >The disks are not full (67% usage across 12 disks per worker). > >It seems that the 143 exit code indicates that an external process has >terminated our tasktracker JVM. Is this correct? > >If so, what would the likely suspects be that would terminate our >tasktrackers? Is it possible this is related to our operating system >(Scientific Linux 6) and PAM limits? > >We had already increased our hard limit on the number of open files for >the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the >hope that this would solve the issue. Can you see anything wrong with our >security limits: > >[hdfs@hd-37-03 hdfs]$ ulimit -aH >core file size (blocks, -c) 0 >data seg size (kbytes, -d) unlimited >scheduling priority (-e) 0 >file size (blocks, -f) unlimited >pending signals (-i) 191988 >max locked memory (kbytes, -l) 64 >max memory size (kbytes, -m) unlimited >open files (-n) 32768 >pipe size (512 bytes, -p) 8 >POSIX message queues (bytes, -q) 819200 >real-time priority (-r) 0 >stack size (kbytes, -s) unlimited >cpu time (seconds, -t) unlimited >max user processes (-u) unlimited >virtual memory (kbytes, -v) unlimited >file locks (-x) unlimited > >Thanks for your help. > >Royston > >On 4 Jan 2013, at 14:34, Robert Evans <[EMAIL PROTECTED]> wrote: > >> Is there anything in the task tracker's logs? Did the machines go down? >> Are there full disks on those nodes? >> >> --Bobby >> >> On 1/4/13 5:52 AM, "Royston Sellman" <[EMAIL PROTECTED]> >> wrote: >> >>> I'm running a job over a 380 billion row 20 TB dataset which is >>>computing >>> sum(), max() etc. The job is running fine at around 3 million rows per >>> second for several hours then grinding to a halt as it loses one after >>> another of the tasktrackers. We see a healthy mix of successful map >>>and >>> reduce attempts on the tasktracker... >>> >>> >>> >>> 2013-01-03 23:41:40,249 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201301031813_0001_m_041109_0 1.0% >>> >>> 2013-01-03 23:41:40,256 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201301031813_0001_m_041105_0 1.0% >>> >>> 2013-01-03 23:41:40,260 INFO org.apache.hadoop.mapred.TaskTracker: >>> attempt_201301031813_0001_m_041105_0 1.0% >>> >>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: Task >>> attempt_201301031813_0001_m_041105_0 is done. >>> >>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >>> reported >>> output size for attempt_201301031813_0001_m_041105_0 was 111 >>> >>> 2013-01-03 23:41:40,261 INFO org.apache.hadoop.mapred.TaskTracker: >>> addFreeSlot : current free slots : 8 >>> >>> 2013-01-03 23:41:40,374 INFO org.apache.hadoop.mapred.TaskTracker:
+
Robert Evans 2013-01-04, 15:16
-
Re: Lost tasktracker errors
Royston Sellman 2013-01-04, 18:04
Hi Bobby,
I suspected OOM problems. Just before I got your last message I made some config changes: First I discovered that I was setting a property called mapreduce.child.java.opts in my mapred-site.xml and apparently this property is deprecated. I edited it to set mapreduce.map.child.java.opts =3D -Xmx1024m and mapreduce.reduce.child.java.opts =3D -Xmx1024m. I also edited hadoop-env.sh so that HADOOP_NAMENODE_OPTS has -Xmx4g and HADOOP_DATANODE_OPTS has -Xmx1024m.=20
Not very scientific to change these all at once but I'm in a hurry.
The job is now running 25% faster than before and log files are growing more slowly. No errors and just three failed tasks (all of which worked second try) after 3.5 hours.
Using top I can see that no java processes are swapping or going mad but I can see that some have swapped in the past so maybe the changes have made a difference
I'll keep checking over the weekend. Should know whether it's going to work by tomorrow a.m. (London time).
Thanks for your help. I'll do my best to report what I did if I resolve this problem. The Internet is full of unresolveds.
Royston
On 4 Jan 2013, at 15:16, Robert Evans <[EMAIL PROTECTED]> wrote:
> This really should be on the user list so I am moving it over there. > > It is probably something about the OS that is killing it. The only thing > that I know of on stock Linux that would do this is the Out of Memory > Killer. Do you have swap enabled on these boxes? You should check the > OOM killer logs, and if that is the case reset the box. > > --Bobby > > On 1/4/13 9:02 AM, "Royston Sellman" <[EMAIL PROTECTED]> > wrote: > >> Hi Bobby, >> >> Thanks for the response Bobby, >> >> The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" >> contained the log messages included in our previous message. It seems to >> show a series of successful map attempts with a few reduce attempts >> interspersed, then it gets to a point and only shows a series of reduce >> attempts that appear to be stuck at the same level of progress, before >> outputting the 143 exit code and the interrupted sleep message at the end. >> >> There is nothing in the tasktracker~.out files... >> >> The machines did not go down but the affected TTs did not log anything >> till I got up in the morning, saw the job had frozen, did stop-all.sh. >> Then the stalled TTs logged the shutdown. >> >> The disks are not full (67% usage across 12 disks per worker). >> >> It seems that the 143 exit code indicates that an external process has >> terminated our tasktracker JVM. Is this correct? >> >> If so, what would the likely suspects be that would terminate our >> tasktrackers? Is it possible this is related to our operating system >> (Scientific Linux 6) and PAM limits? >> >> We had already increased our hard limit on the number of open files for >> the "hdfs" user (that launches hdfs and mapred daemons) to 32768 in the >> hope that this would solve the issue. Can you see anything wrong with our >> security limits: >> >> [hdfs@hd-37-03 hdfs]$ ulimit -aH >> core file size (blocks, -c) 0 >> data seg size (kbytes, -d) unlimited >> scheduling priority (-e) 0 >> file size (blocks, -f) unlimited >> pending signals (-i) 191988 >> max locked memory (kbytes, -l) 64 >> max memory size (kbytes, -m) unlimited >> open files (-n) 32768 >> pipe size (512 bytes, -p) 8 >> POSIX message queues (bytes, -q) 819200 >> real-time priority (-r) 0 >> stack size (kbytes, -s) unlimited >> cpu time (seconds, -t) unlimited >> max user processes (-u) unlimited >> virtual memory (kbytes, -v) unlimited >> file locks (-x) unlimited >> >> Thanks for your help. >> >> Royston >> >> On 4 Jan 2013, at 14:34, Robert Evans <[EMAIL PROTECTED]> wrote: >> >>> Is there anything in the task tracker's logs? Did the machines go down?
+
Royston Sellman 2013-01-04, 18:04
-
Re: Lost tasktracker errors
Jeff Bean 2013-01-07, 21:03
Hi,
I think this issue looks like the jobtracker is running out of RAM, too. "Lost task tracker" is indicative of what I call the "million mapper march". Lots of tasks running per user, regularly, generating job history that's running out of RAM. JT starts to swap and/or gc pause, TT heartbeats get dropped on the floor or delayed, tasks get rescheduled. If you restarted the JT when you made these changes, you might have just masked the issue.
Things to make sure you try, in descending order of importance:
1. Make sure your job doesn't have too many tasks. The 20TB of data spread over a tens to thousand files is going to spawn fewer tasks than the same task spread over millions of files. 2. Make sure your job tracker isn't getting clogged up with history of these big tasks. Lower mapred.jobtracker.completeuserjobs.maximum from its default of 100 down to about 10 or so. 3. Increase the heap allocated to the job tracker.
Jeff
On Fri, Jan 4, 2013 at 10:04 AM, Royston Sellman < [EMAIL PROTECTED]> wrote:
> Hi Bobby, > > I suspected OOM problems. Just before I got your last message I made > some config changes: First I discovered that I was setting a property > called mapreduce.child.java.opts in my mapred-site.xml and apparently > this property is deprecated. I edited it to set > mapreduce.map.child.java.opts =3D -Xmx1024m and > mapreduce.reduce.child.java.opts =3D -Xmx1024m. I also edited > hadoop-env.sh so that HADOOP_NAMENODE_OPTS has -Xmx4g and > HADOOP_DATANODE_OPTS has -Xmx1024m.=20 > > Not very scientific to change these all at once but I'm in a hurry. > > The job is now running 25% faster than before and log files are growing > more slowly. No errors and just three failed tasks (all of which worked > second try) after 3.5 hours. > > Using top I can see that no java processes are swapping or going mad but > I can see that some have swapped in the past so maybe the changes have > made a difference > > I'll keep checking over the weekend. Should know whether it's going to > work by tomorrow a.m. (London time). > > Thanks for your help. I'll do my best to report what I did if I resolve > this problem. The Internet is full of unresolveds. > > Royston > > On 4 Jan 2013, at 15:16, Robert Evans <[EMAIL PROTECTED]> wrote: > > > This really should be on the user list so I am moving it over there. > > > > It is probably something about the OS that is killing it. The only thing > > that I know of on stock Linux that would do this is the Out of Memory > > Killer. Do you have swap enabled on these boxes? You should check the > > OOM killer logs, and if that is the case reset the box. > > > > --Bobby > > > > On 1/4/13 9:02 AM, "Royston Sellman" <[EMAIL PROTECTED]> > > wrote: > > > >> Hi Bobby, > >> > >> Thanks for the response Bobby, > >> > >> The tasktracker logs such as "hadoop-hdfs-tasktracker-hd-37-03.log" > >> contained the log messages included in our previous message. It seems to > >> show a series of successful map attempts with a few reduce attempts > >> interspersed, then it gets to a point and only shows a series of reduce > >> attempts that appear to be stuck at the same level of progress, before > >> outputting the 143 exit code and the interrupted sleep message at the > end. > >> > >> There is nothing in the tasktracker~.out files... > >> > >> The machines did not go down but the affected TTs did not log anything > >> till I got up in the morning, saw the job had frozen, did stop-all.sh. > >> Then the stalled TTs logged the shutdown. > >> > >> The disks are not full (67% usage across 12 disks per worker). > >> > >> It seems that the 143 exit code indicates that an external process has > >> terminated our tasktracker JVM. Is this correct? > >> > >> If so, what would the likely suspects be that would terminate our > >> tasktrackers? Is it possible this is related to our operating system > >> (Scientific Linux 6) and PAM limits? > >> > >> We had already increased our hard limit on the number of open files for
+
Jeff Bean 2013-01-07, 21:03
|
|