|
Sandeep Reddy P
2012-05-22, 14:02
Harsh J
2012-05-22, 14:13
Sandeep Reddy P
2012-05-22, 14:17
Sandeep Reddy P
2012-05-22, 14:23
Raj Vishwanathan
2012-05-22, 14:50
Raj Vishwanathan
2012-05-22, 14:53
Sandeep Reddy P
2012-05-22, 15:02
Arun C Murthy
2012-05-22, 17:31
Sandeep Reddy P
2012-05-22, 17:35
Raj Vishwanathan
2012-05-22, 17:49
Sandeep Reddy P
2012-05-22, 17:56
Sandeep Reddy P
2012-05-22, 18:14
|
-
Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 14:02
Hi,
We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort some of the map tasks are Failed/Killed and the logs show similar error on all machines. 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: Exception in createBlockOutputStream 10.0.25.149:50010 java.net.SocketTimeoutException: 69000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 remote=/10.0.25.149:50010] 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: Abandoning block blk_7260720956806950576_1825 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: Excluding datanode 10.0.25.149:50010 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent died. Exiting attempt_201205211504_0007_m_000016_1. Are these kind of errors common?? Atleast 1 map task is failing due to above reason on all the machines.We are using 24 mappers for teragen. For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers and 17failed/8 killed task attempts. 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. Cluster works good for small datasets.
-
Re: Map/Reduce Tasks FailsHarsh J 2012-05-22, 14:13
Sandeep,
Is the same DN 10.0.25.149 reported across all failures? And do you notice any machine patterns when observing the failed tasks (i.e. are they clumped on any one or a few particular TTs repeatedly)? On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P <[EMAIL PROTECTED]> wrote: > Hi, > We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort > some of the map tasks are Failed/Killed and the logs show similar error on > all machines. > > 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: > Exception in createBlockOutputStream 10.0.25.149:50010 > java.net.SocketTimeoutException: 69000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 > remote=/10.0.25.149:50010] > 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: > Abandoning block blk_7260720956806950576_1825 > 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: > Excluding datanode 10.0.25.149:50010 > 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent > died. Exiting attempt_201205211504_0007_m_000016_1. > > > > Are these kind of errors common?? Atleast 1 map task is failing due to > above reason on all the machines.We are using 24 mappers for teragen. > For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers > and 17failed/8 killed task attempts. > > 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. > Cluster works good for small datasets. -- Harsh J
-
Re: Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 14:17
I see killed maps on almost all machines.I just finished terasort on 5gb
data with 9 killed map tasks.
-
Re: Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 14:23
*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce
Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since Start* *Total Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded Tasks Last Hour* *Seconds since heartbeat* tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225<http://hadoop2.liaisondevqa.local:50060/> hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0 tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363<http://hadoop4.liaisondevqa.local:50060/> hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0 tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605<http://hadoop5.liaisondevqa.local:50060/> hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0 tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305<http://hadoop3.liaisondevqa.local:50060/> hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0 Highest Failures: tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22 failures
-
Re: Map/Reduce Tasks FailsRaj Vishwanathan 2012-05-22, 14:50
>________________________________ > From: Harsh J <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Tuesday, May 22, 2012 7:13 AM >Subject: Re: Map/Reduce Tasks Fails > >Sandeep, > >Is the same DN 10.0.25.149 reported across all failures? And do you >notice any machine patterns when observing the failed tasks (i.e. are >they clumped on any one or a few particular TTs repeatedly)? > >On Tue, May 22, 2012 at 7:32 PM, Sandeep Reddy P ><[EMAIL PROTECTED]> wrote: >> Hi, >> We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort >> some of the map tasks are Failed/Killed and the logs show similar error on >> all machines. >> >> 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: >> Exception in createBlockOutputStream 10.0.25.149:50010 >> java.net.SocketTimeoutException: 69000 millis timeout while waiting >> for channel to be ready for read. ch : >> java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 >> remote=/10.0.25.149:50010] >> 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: >> Abandoning block blk_7260720956806950576_1825 >> 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: >> Excluding datanode 10.0.25.149:50010 >> 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent >> died. Exiting attempt_201205211504_0007_m_000016_1. >> >> >> >> Are these kind of errors common?? Atleast 1 map task is failing due to >> above reason on all the machines.We are using 24 mappers for teragen. >> For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers >> and 17failed/8 killed task attempts. >> >> 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. >> Cluster works good for small datasets. > > > >-- >Harsh J > > >
-
Re: Map/Reduce Tasks FailsRaj Vishwanathan 2012-05-22, 14:53
What kind of storage is attached to the data nodes ? This kind of error can happen when the CPU is really busy with I/O or interrupts.
Can you run top or dstat on some of the data nodes to see how the system is performing? Raj >________________________________ > From: Sandeep Reddy P <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED] >Sent: Tuesday, May 22, 2012 7:23 AM >Subject: Re: Map/Reduce Tasks Fails > >*Task Trackers* *Name**Host**# running tasks**Max Map Tasks**Max Reduce >Tasks**Task Failures**Directory Failures**Node Health Status**Seconds Since >Node Last Healthy**Total Tasks Since Start* *Succeeded Tasks Since >Start* *Total >Tasks Last Day* *Succeeded Tasks Last Day* *Total Tasks Last Hour* *Succeeded >Tasks Last Hour* *Seconds since heartbeat* >tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225<http://hadoop2.liaisondevqa.local:50060/> >hadoop2.liaisondevqa.local062220N/A093 60 59 28 64 38 0 >tracker_hadoop4.liaisondevqa.local:localhost/127.0.0.1:40363<http://hadoop4.liaisondevqa.local:50060/> >hadoop4.liaisondevqa.local062190N/A091 59 65 33 36 33 0 >tracker_hadoop5.liaisondevqa.local:localhost/127.0.0.1:46605<http://hadoop5.liaisondevqa.local:50060/> >hadoop5.liaisondevqa.local162210N/A083 47 69 35 45 19 0 >tracker_hadoop3.liaisondevqa.local:localhost/127.0.0.1:37305<http://hadoop3.liaisondevqa.local:50060/> >hadoop3.liaisondevqa.local062180N/A087 55 55 28 57 34 0 Highest Failures: >tracker_hadoop2.liaisondevqa.local:localhost/127.0.0.1:56225 with 22 >failures > > >
-
Re: Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 15:02
Hi Raj,
We are using SAN shared storage used by multiple servers connected over iSCSI. TOP from one of the datanode top - 11:01:04 up 19:53, 1 user, load average: 0.00, 0.00, 0.35 Tasks: 180 total, 1 running, 179 sleeping, 0 stopped, 0 zombie Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 8061608k total, 5010408k used, 3051200k free, 13152k buffers Swap: 2097144k total, 272k used, 2096872k free, 4355840k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1714 mapred 20 0 1582m 129m 11m S 0.7 1.6 5:49.68 java 14331 root 20 0 15012 1364 988 R 0.3 0.0 0:00.02 top 1 root 20 0 19204 1372 1084 S 0.0 0.0 0:00.82 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2 14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3 18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3 19 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4 20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4 21 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/4 22 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/4 23 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/5 24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5 25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5 26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/5 27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events/0 28 root 20 0 0 0 0 S 0.0 0.0 0:04.27 events/1 29 root 20 0 0 0 0 S 0.0 0.0 0:02.39 events/2 30 root 20 0 0 0 0 S 0.0 0.0 0:01.46 events/3 31 root 20 0 0 0 0 S 0.0 0.0 0:00.11 events/4 32 root 20 0 0 0 0 S 0.0 0.0 0:00.84 events/5 33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset 34 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper 35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
-
Re: Map/Reduce Tasks FailsArun C Murthy 2012-05-22, 17:31
Seems like a question better suited for Cloudera lists...
On May 22, 2012, at 7:02 AM, Sandeep Reddy P wrote: > Hi, > We have a 5node cdh3u4 cluster running. When i try to do teragen/terasort > some of the map tasks are Failed/Killed and the logs show similar error on > all machines. > > 2012-05-22 09:43:50,831 INFO org.apache.hadoop.hdfs.DFSClient: > Exception in createBlockOutputStream 10.0.25.149:50010 > java.net.SocketTimeoutException: 69000 millis timeout while waiting > for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/10.0.25.149:55835 > remote=/10.0.25.149:50010] > 2012-05-22 09:44:25,968 INFO org.apache.hadoop.hdfs.DFSClient: > Abandoning block blk_7260720956806950576_1825 > 2012-05-22 09:44:25,973 INFO org.apache.hadoop.hdfs.DFSClient: > Excluding datanode 10.0.25.149:50010 > 2012-05-22 09:46:36,350 WARN org.apache.hadoop.mapred.Task: Parent > died. Exiting attempt_201205211504_0007_m_000016_1. > > > > Are these kind of errors common?? Atleast 1 map task is failing due to > above reason on all the machines.We are using 24 mappers for teragen. > For us it took 3hrs 44min 17 sec to generate 50Gb data with 24 mappers > and 17failed/8 killed task attempts. > > 24min 10 sec for 5GB data with 24 mappers and 9 killed Task attempts. > Cluster works good for small datasets. -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
-
Re: Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 17:35
I got samilar errors for Apache Hadoop 1.0.0
Thanks, Sandeep.
-
Re: Map/Reduce Tasks FailsRaj Vishwanathan 2012-05-22, 17:49
Sandeep
How many network interfaces? Are the network shared between iSCSI and M/R communications? Is this the top when the system is idle or when you are getting errors? ( I am guessing idle!) Raj >________________________________ > From: Sandeep Reddy P <[EMAIL PROTECTED]> >To: [EMAIL PROTECTED]; Raj Vishwanathan <[EMAIL PROTECTED]> >Sent: Tuesday, May 22, 2012 8:02 AM >Subject: Re: Map/Reduce Tasks Fails > >Hi Raj, >We are using SAN shared storage used by multiple servers connected over >iSCSI. > > >TOP from one of the datanode > >top - 11:01:04 up 19:53, 1 user, load average: 0.00, 0.00, 0.35 >Tasks: 180 total, 1 running, 179 sleeping, 0 stopped, 0 zombie >Cpu(s): 0.1%us, 0.1%sy, 0.0%ni, 99.9%id, 0.0%wa, 0.0%hi, 0.0%si, >0.0%st >Mem: 8061608k total, 5010408k used, 3051200k free, 13152k buffers >Swap: 2097144k total, 272k used, 2096872k free, 4355840k cached > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND >1714 mapred 20 0 1582m 129m 11m S 0.7 1.6 5:49.68 java >14331 root 20 0 15012 1364 988 R 0.3 0.0 0:00.02 top > 1 root 20 0 19204 1372 1084 S 0.0 0.0 0:00.82 init > 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd > 3 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 > 4 root 20 0 0 0 0 S 0.0 0.0 0:00.14 ksoftirqd/0 > 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0 > 6 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0 > 7 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1 > 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1 > 9 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1 > 10 root RT 0 0 0 0 S 0.0 0.0 0:00.04 watchdog/1 > 11 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 > 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 > 13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/2 > 14 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2 > 15 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 > 16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3 > 17 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/3 > 18 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/3 > 19 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4 > 20 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/4 > 21 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/4 > 22 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/4 > 23 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/5 > 24 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/5 > 25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/5 > 26 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/5 > 27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 events/0 > 28 root 20 0 0 0 0 S 0.0 0.0 0:04.27 events/1 > 29 root 20 0 0 0 0 S 0.0 0.0 0:02.39 events/2 > 30 root 20 0 0 0 0 S 0.0 0.0 0:01.46 events/3 > 31 root 20 0 0 0 0 S 0.0 0.0 0:00.11 events/4 > 32 root 20 0 0 0 0 S 0.0 0.0 0:00.84 events/5 > 33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cpuset > 34 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper > 35 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns > > >
-
Re: Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 17:56
Raj,
- Network Card: VMware generic Gigabit Network adapter. As longer as this VMs are only talking to each other, the communication speed will be close to 1Gb. Top is when the systems are idle. ] Th e E
-
Re: Map/Reduce Tasks FailsSandeep Reddy P 2012-05-22, 18:14
Raj,
Top from one datanode when i get error from that machine top - 14:10:15 up 23:12, 1 user, load average: 13.45, 12.91, 8.31 Tasks: 187 total, 1 running, 186 sleeping, 0 stopped, 0 zombie Cpu(s): 0.7%us, 0.4%sy, 0.0%ni, 0.0%id, 98.9%wa, 0.0%hi, 0.1%si, 0.0%st Mem: 8061608k total, 7927124k used, 134484k free, 19316k buffers Swap: 2097144k total, 384k used, 2096760k free, 6694656k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 1622 hdfs 20 0 1619m 157m 11m S 2.0 2.0 33:55.42 java 14712 mapred 20 0 709m 119m 11m S 1.3 1.5 0:10.06 java 1706 mapred 20 0 1588m 126m 11m S 1.0 1.6 24:51.69 java 14663 mapred 20 0 708m 89m 11m S 1.0 1.1 0:11.23 java 14686 mapred 20 0 714m 106m 11m S 0.7 1.4 0:11.53 java 14762 mapred 20 0 710m 89m 11m S 0.7 1.1 0:10.05 java 14640 mapred 20 0 704m 119m 11m S 0.3 1.5 0:11.36 java Error Message: 12/05/22 14:09:52 INFO mapred.JobClient: Task Id : attempt_201205211504_0009_m_000002_0, Status : FAILED java.io.IOException: All datanodes 10.0.24.175:50010 are bad. Aborting... at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3181) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2100(DFSClient.java:2720) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2892) attempt_201205211504_0009_m_000002_0: log4j:WARN No appenders could be found for logger (org.apache.hadoop.hdfs.DFSClient). attempt_201205211504_0009_m_000002_0: log4j:WARN Please initialize the log4j system properly. But other map tasks are running on the same datanode. Thanks, sandeep. |