|
Björn-Elmar Macek
2012-08-13, 12:31
Mohammad Tariq
2012-08-13, 12:51
Björn-Elmar Macek
2012-08-13, 13:08
Michael Segel
2012-08-13, 12:59
Björn-Elmar Macek
2012-08-13, 13:17
Michael Segel
2012-08-13, 13:36
Mohammad Tariq
2012-08-13, 14:12
Björn-Elmar Macek
2012-08-13, 14:57
James Brown
2012-08-14, 06:51
Sriram Ramachandrasekaran...
2012-08-13, 16:37
Michael Segel
2012-08-13, 20:39
Björn-Elmar Macek
2012-08-16, 13:17
Björn-Elmar Macek
2012-08-20, 10:15
|
-
DataNode and Tasttracker communicationBjörn-Elmar Macek 2012-08-13, 12:31
Hi,
i am currently trying to run my hadoop program on a cluster. Sadly though my datanodes and tasktrackers seem to have difficulties with their communication as their logs say: * Some datanodes and tasktrackers seem to have portproblems of some kind as it can be seen in the logs below. I wondered if this might be due to reasons correllated with the localhost entry in /etc/hosts as you can read in alot of posts with similar errors, but i checked the file neither localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you can ping localhost... the technician of the cluster said he'd be looking for the mechanics resolving localhost) * The other nodes can not speak with the namenode and jobtracker (its-cs131). Although it is absolutely not clear, why this is happening: the "dfs -put" i do directly before the job is running fine, which seems to imply that communication between those servers is working flawlessly. Is there any reason why this might happen? Regards, Elmar LOGS BELOW: \____Datanodes After successfully putting the data to hdfs (at this point i thought namenode and datanodes have to communicate), i get the following errors when starting the job: There are 2 kinds of logs i found: the first one is big (about 12MB) and looks like this: ############################### LOG TYPE 1 ############################################################ 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 0 time(s). 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 1 time(s). 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 2 time(s). 2012-08-13 08:23:30,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 3 time(s). 2012-08-13 08:23:31,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 4 time(s). 2012-08-13 08:23:32,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 5 time(s). 2012-08-13 08:23:33,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 6 time(s). 2012-08-13 08:23:34,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 7 time(s). 2012-08-13 08:23:35,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 8 time(s). 2012-08-13 08:23:36,335 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 9 time(s). 2012-08-13 08:23:36,335 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.ConnectException: Call to its-cs131/141.51.205.41:35554 failed on connection exception: java.net.ConnectException: Connection refused at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) at org.apache.hadoop.ipc.Client.call(Client.java:1071) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) at $Proxy5.sendHeartbeat(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.offerService(DataNode.java:904) at org.apache.hadoop.hdfs.server.datanode.DataNode.run(DataNode.java:1458) at java.lang.Thread.run(Thread.java:619) Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:434) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:560) at org.apache.hadoop.ipc.Client$Connection.access$2000(Client.java:184) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1202) at org.apache.hadoop.ipc.Client.call(Client.java:1046) ... 5 more ... (this continues til the end of the log) The second is short kind: ########################### LOG TYPE 2 ############################################################ 2012-08-13 00:59:19,038 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting DataNode STARTUP_MSG: host = its-cs133.its.uni-kassel.de/141.51.205.43 STARTUP_MSG: args = [] STARTUP_MSG: version = 1.0.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0.2 -r 1304954; compiled by 'hortonfo' on Sat Mar 24 23:58:21 UTC 2012 ************************************************************/ 2012-08-13 00:59:19,203 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2012-08-13 00:59:19,216 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered. 2012-08-13 00:59:19,217 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s). 2012-08-13 00:59:19,218 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started 2012-08-13 00:59:19,306 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered. 2012-08-13 00:59:19,346 INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native-hadoop library 2012-08-13 00:59:20,482 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 0 time(s). 2012-08-13 00:59:21,584 INFO org.apache.hadoop.hdfs.server.common.Storage: Storage directory /home/work/bmacek/hadoop/hdfs/slave is not formatted. 2012-08-13 00:59:21,584 IN +
Björn-Elmar Macek 2012-08-13, 12:31
-
Re: DataNode and Tasttracker communicationMohammad Tariq 2012-08-13, 12:51
Hello there,
Could you please share your /etc/hosts file, if you don't mind. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek <[EMAIL PROTECTED]>wrote: > Hi, > > i am currently trying to run my hadoop program on a cluster. Sadly though > my datanodes and tasktrackers seem to have difficulties with their > communication as their logs say: > * Some datanodes and tasktrackers seem to have portproblems of some kind > as it can be seen in the logs below. I wondered if this might be due to > reasons correllated with the localhost entry in /etc/hosts as you can read > in alot of posts with similar errors, but i checked the file neither > localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you can ping > localhost... the technician of the cluster said he'd be looking for the > mechanics resolving localhost) > * The other nodes can not speak with the namenode and jobtracker > (its-cs131). Although it is absolutely not clear, why this is happening: > the "dfs -put" i do directly before the job is running fine, which seems to > imply that communication between those servers is working flawlessly. > > Is there any reason why this might happen? > > > Regards, > Elmar > > LOGS BELOW: > > \____Datanodes > > After successfully putting the data to hdfs (at this point i thought > namenode and datanodes have to communicate), i get the following errors > when starting the job: > > There are 2 kinds of logs i found: the first one is big (about 12MB) and > looks like this: > ##############################**# LOG TYPE 1 > ##############################**############################## > 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 0 time(s). > 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 1 time(s). > 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 2 time(s). > 2012-08-13 08:23:30,332 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 3 time(s). > 2012-08-13 08:23:31,333 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 4 time(s). > 2012-08-13 08:23:32,333 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 5 time(s). > 2012-08-13 08:23:33,334 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 6 time(s). > 2012-08-13 08:23:34,334 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 7 time(s). > 2012-08-13 08:23:35,334 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 8 time(s). > 2012-08-13 08:23:36,335 INFO org.apache.hadoop.ipc.Client: Retrying > connect to server: its-cs131/141.51.205.41:35554. Already tried 9 time(s). > 2012-08-13 08:23:36,335 WARN org.apache.hadoop.hdfs.server.**datanode.DataNode: > java.net.ConnectException: Call to its-cs131/141.51.205.41:35554 failed > on connection exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.**wrapException(Client.java:**1095) > at org.apache.hadoop.ipc.Client.**call(Client.java:1071) > at org.apache.hadoop.ipc.RPC$**Invoker.invoke(RPC.java:225) > at $Proxy5.sendHeartbeat(Unknown Source) > at org.apache.hadoop.hdfs.server.**datanode.DataNode.** > offerService(DataNode.java:**904) > at org.apache.hadoop.hdfs.server.**datanode.DataNode.run(** > DataNode.java:1458) > at java.lang.Thread.run(Thread.**java:619) > Caused by: java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.**checkConnect(Native Method) > at sun.nio.ch.SocketChannelImpl.**finishConnect(** +
Mohammad Tariq 2012-08-13, 12:51
-
Re: DataNode and Tasttracker communicationBjörn-Elmar Macek 2012-08-13, 13:08
Sure i can, but it is long as it is a cluster:
141.51.12.86 hrz-cs400.hrz.uni-kassel.de hrz-cs400 141.51.204.11 hrz-cs401.hrz.uni-kassel.de hrz-cs401 141.51.204.12 hrz-cs402.hrz.uni-kassel.de hrz-cs402 141.51.204.13 hrz-cs403.hrz.uni-kassel.de hrz-cs403 141.51.204.14 hrz-cs404.hrz.uni-kassel.de hrz-cs404 141.51.204.15 hrz-cs405.hrz.uni-kassel.de hrz-cs405 141.51.204.16 hrz-cs406.hrz.uni-kassel.de hrz-cs406 141.51.204.17 hrz-cs407.hrz.uni-kassel.de hrz-cs407 141.51.204.18 hrz-cs408.hrz.uni-kassel.de hrz-cs408 141.51.204.19 hrz-cs409.hrz.uni-kassel.de hrz-cs409 141.51.204.20 hrz-cs410.hrz.uni-kassel.de hrz-cs410 141.51.204.21 hrz-cs411.hrz.uni-kassel.de hrz-cs411 141.51.204.22 hrz-cs412.hrz.uni-kassel.de hrz-cs412 141.51.204.23 hrz-cs413.hrz.uni-kassel.de hrz-cs413 141.51.204.24 hrz-cs414.hrz.uni-kassel.de hrz-cs414 141.51.204.25 hrz-cs415.hrz.uni-kassel.de hrz-cs415 141.51.204.26 hrz-cs416.hrz.uni-kassel.de hrz-cs416 141.51.204.27 hrz-cs417.hrz.uni-kassel.de hrz-cs417 141.51.204.28 hrz-cs418.hrz.uni-kassel.de hrz-cs418 141.51.204.29 hrz-cs419.hrz.uni-kassel.de hrz-cs419 141.51.204.31 hrz-cs421.hrz.uni-kassel.de hrz-cs421 141.51.204.32 hrz-cs422.hrz.uni-kassel.de hrz-cs422 141.51.204.33 hrz-cs423.hrz.uni-kassel.de hrz-cs423 141.51.204.34 hrz-cs424.hrz.uni-kassel.de hrz-cs424 141.51.204.35 hrz-cs425.hrz.uni-kassel.de hrz-cs425 141.51.204.36 hrz-cs426.hrz.uni-kassel.de hrz-cs426 141.51.204.37 hrz-cs427.hrz.uni-kassel.de hrz-cs427 141.51.204.38 hrz-cs428.hrz.uni-kassel.de hrz-cs428 141.51.204.39 hrz-cs429.hrz.uni-kassel.de hrz-cs429 141.51.204.40 hrz-cs430.hrz.uni-kassel.de hrz-cs430 141.51.204.47 hrz-cs437.hrz.uni-kassel.de hrz-cs437 141.51.204.48 hrz-cs438.hrz.uni-kassel.de hrz-cs438 141.51.204.49 hrz-cs439.hrz.uni-kassel.de hrz-cs439 141.51.204.50 hrz-cs440.hrz.uni-kassel.de hrz-cs440 141.51.204.51 hrz-cs441.hrz.uni-kassel.de hrz-cs441 141.51.204.54 hrz-cs444.hrz.uni-kassel.de hrz-cs444 141.51.204.65 hrz-cs455.hrz.uni-kassel.de hrz-cs455 141.51.204.66 hrz-cs456.hrz.uni-kassel.de hrz-cs456 141.51.204.69 hrz-cs459.hrz.uni-kassel.de hrz-cs459 141.51.204.70 hrz-cs460.hrz.uni-kassel.de hrz-cs460 141.51.204.71 hrz-cs461.hrz.uni-kassel.de hrz-cs461 141.51.204.72 hrz-cs462.hrz.uni-kassel.de hrz-cs462 141.51.204.73 hrz-cs463.hrz.uni-kassel.de hrz-cs463 141.51.204.74 hrz-cs464.hrz.uni-kassel.de hrz-cs464 141.51.204.75 hrz-cs465.hrz.uni-kassel.de hrz-cs465 141.51.204.76 hrz-cs466.hrz.uni-kassel.de hrz-cs466 141.51.204.77 hrz-cs467.hrz.uni-kassel.de hrz-cs467 141.51.204.78 hrz-cs468.hrz.uni-kassel.de hrz-cs468 141.51.204.79 hrz-cs469.hrz.uni-kassel.de hrz-cs469 141.51.204.80 hrz-cs470.hrz.uni-kassel.de hrz-cs470 141.51.204.81 hrz-cs471.hrz.uni-kassel.de hrz-cs471 141.51.204.82 hrz-cs472.hrz.uni-kassel.de hrz-cs472 141.51.204.83 hrz-cs473.hrz.uni-kassel.de hrz-cs473 141.51.204.84 hrz-cs474.hrz.uni-kassel.de hrz-cs474 141.51.204.85 hrz-cs475.hrz.uni-kassel.de hrz-cs475 141.51.204.86 hrz-cs476.hrz.uni-kassel.de hrz-cs476 141.51.204.87 hrz-cs477.hrz.uni-kassel.de hrz-cs477 141.51.204.88 hrz-cs478.hrz.uni-kassel.de hrz-cs478 141.51.204.89 hrz-cs479.hrz.uni-kassel.de hrz-cs479 141.51.204.90 hrz-cs480.hrz.uni-kassel.de hrz-cs480 141.51.204.91 hrz-cs481.hrz.uni-kassel.de hrz-cs481 141.51.204.92 hrz-cs482.hrz.uni-kassel.de hrz-cs482 141.51.204.93 hrz-cs483.hrz.uni-kassel.de hrz-cs483 141.51.204.94 hrz-cs484.hrz.uni-kassel.de hrz-cs484 141.51.204.95 hrz-cs485.hrz.uni-kassel.de hrz-cs485 141.51.204.96 hrz-cs486.hrz.uni-kassel.de hrz-cs486 141.51.204.97 hrz-cs487.hrz.uni-kassel.de hrz-cs487 141.51.204.98 hrz-cs488.hrz.uni-kassel.de hrz-cs488 141.51.204.99 hrz-cs489.hrz.uni-kassel.de hrz-cs489 141.51.204.100 hrz-cs490.hrz.uni-kassel.de hrz-cs490 141.51.204.101 hrz-cs491.hrz.uni-kassel.de hrz-cs491 141.51.204.102 hrz-cs492.hrz.uni-kassel.de hrz-cs492 141.51.204.103 hrz-cs493.hrz.uni-kassel.de hrz-cs493 141.51.204.104 hrz-cs494.hrz.uni-kassel.de hrz-cs494 141.51.204.105 hrz-cs495.hrz.uni-kassel.de hrz-cs495 141.51.204.106 hrz-cs496.hrz.uni-kassel.de hrz-cs496 141.51.204.107 hrz-cs497.hrz.uni-kassel.de hrz-cs497 141.51.204.108 hrz-cs498.hrz.uni-kassel.de hrz-cs498 141.51.204.109 hrz-cs499.hrz.uni-kassel.de hrz-cs499 141.51.204.110 hrz-cs500.hrz.uni-kassel.de hrz-cs500 141.51.204.111 hrz-cs501.hrz.uni-kassel.de hrz-cs501 141.51.204.112 hrz-cs502.hrz.uni-kassel.de hrz-cs502 141.51.204.113 hrz-cs503.hrz.uni-kassel.de hrz-cs503 141.51.204.114 hrz-cs504.hrz.uni-kassel.de hrz-cs504 141.51.204.115 hrz-cs505.hrz.uni-kassel.de hrz-cs505 141.51.204.116 hrz-cs506.hrz.uni-kassel.de hrz-cs506 141.51.204.117 hrz-cs507.hrz.uni-kassel.de hrz-cs507 141.51.204.118 hrz-cs508.hrz.uni-kassel.de hrz-cs508 141.51.204.119 hrz-cs509.hrz.uni-kassel.de hrz-cs509 141.51.204.120 hrz-cs510.hrz.uni-kassel.de hrz-cs510 141.51.204.121 hrz-cs511.hrz.uni-kassel.de hrz-cs511 141.51.204.122 hrz-cs512.hrz.uni-kassel.de hrz-cs512 141.51.204.123 hrz-cs513.hrz.uni-kassel.de hrz-cs513 141.51.204.124 hrz-cs514.hrz.uni-kassel.de hrz-cs514 141.51.204.125 hrz-cs515.hrz.uni-kassel.de hrz-cs515 141.51.204.126 hrz-cs516.hrz.uni-kassel.de hrz-cs516 141.51.204.127 hrz-cs517.hrz.uni-kassel.de hrz-cs517 141.51.204.128 hrz-cs518.hrz.uni-kassel.de hrz-cs518 141.51.204.129 hrz-cs519.hrz.uni-kassel.de hrz-cs519 141.51.204.130 hrz-cs520.hrz.uni-kassel.de hrz-cs520 141.51.204.131 hrz-cs521.hrz.uni-kassel.de hrz-cs521 141.51.204.132 hrz-cs522.hrz.uni-kassel.de hrz-cs522 141.51.204.133 hrz-cs523.hrz.uni-kassel.de hrz-cs523 141.51.204.134 hrz-cs524.hrz.uni-kassel.de hrz-cs524 141.51.204.135 hrz-cs525.hrz.uni-kassel.de hrz-cs525 141.51.204.136 hrz-cs526.hrz.uni-kassel.de hrz-cs526 141.51.204.137 hrz-cs527.hrz.uni-kassel.de hrz-cs527 141.51.204.138 hrz-cs528.hrz.uni-kassel.de hrz-cs528 141.51.204.139 hrz-cs529.hrz.uni-kassel.de hrz-cs529 141.51.204.140 hrz-cs530.hrz.uni-kassel.de hrz-cs530 141.51.204.141 hrz-cs531.hrz.uni-kassel.de hrz-cs531 141.51.204.142 hrz-cs532.hrz.uni-kassel.de hrz-cs532 141.51.204.143 hrz-cs533.hrz.uni-kassel.de hrz-cs533 141.51.204.144 hrz-cs534.hrz.u +
Björn-Elmar Macek 2012-08-13, 13:08
-
Re: DataNode and Tasttracker communicationMichael Segel 2012-08-13, 12:59
If the nodes can communicate and distribute data, then the odds are that the issue isn't going to be in his /etc/hosts.
A more relevant question is if he's running a firewall on each of these machines? A simple test... ssh to one node, ping other nodes and the control nodes at random to see if they can see one another. Then check to see if there is a firewall running which would limit the types of traffic between nodes. One other side note... are these machines multi-homed? On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Hello there, > > Could you please share your /etc/hosts file, if you don't mind. > > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek <[EMAIL PROTECTED]> wrote: > Hi, > > i am currently trying to run my hadoop program on a cluster. Sadly though my datanodes and tasktrackers seem to have difficulties with their communication as their logs say: > * Some datanodes and tasktrackers seem to have portproblems of some kind as it can be seen in the logs below. I wondered if this might be due to reasons correllated with the localhost entry in /etc/hosts as you can read in alot of posts with similar errors, but i checked the file neither localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you can ping localhost... the technician of the cluster said he'd be looking for the mechanics resolving localhost) > * The other nodes can not speak with the namenode and jobtracker (its-cs131). Although it is absolutely not clear, why this is happening: the "dfs -put" i do directly before the job is running fine, which seems to imply that communication between those servers is working flawlessly. > > Is there any reason why this might happen? > > > Regards, > Elmar > > LOGS BELOW: > > \____Datanodes > > After successfully putting the data to hdfs (at this point i thought namenode and datanodes have to communicate), i get the following errors when starting the job: > > There are 2 kinds of logs i found: the first one is big (about 12MB) and looks like this: > ############################### LOG TYPE 1 ############################################################ > 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 0 time(s). > 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 1 time(s). > 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 2 time(s). > 2012-08-13 08:23:30,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 3 time(s). > 2012-08-13 08:23:31,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 4 time(s). > 2012-08-13 08:23:32,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 5 time(s). > 2012-08-13 08:23:33,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 6 time(s). > 2012-08-13 08:23:34,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 7 time(s). > 2012-08-13 08:23:35,334 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 8 time(s). > 2012-08-13 08:23:36,335 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 9 time(s). > 2012-08-13 08:23:36,335 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.ConnectException: Call to its-cs131/141.51.205.41:35554 failed on connection exception: java.net.ConnectException: Connection refused > at org.apache.hadoop.ipc.Client.wrapException(Client.java:1095) > at org.apache.hadoop.ipc.Client.call(Client.java:1071) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:225) +
Michael Segel 2012-08-13, 12:59
-
Re: DataNode and Tasttracker communicationBjörn-Elmar Macek 2012-08-13, 13:17
Hi Michael,
well i can ssh from any node to any other without being prompted. The reason for this is, that my home dir is mounted in every server in the cluster. If the machines are multihomed: i dont know. i could ask if this would be of importance. Shall i? Regards, Elmar Am 13.08.12 14:59, schrieb Michael Segel: > If the nodes can communicate and distribute data, then the odds are > that the issue isn't going to be in his /etc/hosts. > > A more relevant question is if he's running a firewall on each of > these machines? > > A simple test... ssh to one node, ping other nodes and the control > nodes at random to see if they can see one another. Then check to see > if there is a firewall running which would limit the types of traffic > between nodes. > > One other side note... are these machines multi-homed? > > On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > >> Hello there, >> >> Could you please share your /etc/hosts file, if you don't mind. >> >> Regards, >> Mohammad Tariq >> >> >> >> On Mon, Aug 13, 2012 at 6:01 PM, Bj�rn-Elmar Macek >> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: >> >> Hi, >> >> i am currently trying to run my hadoop program on a cluster. >> Sadly though my datanodes and tasktrackers seem to have >> difficulties with their communication as their logs say: >> * Some datanodes and tasktrackers seem to have portproblems of >> some kind as it can be seen in the logs below. I wondered if this >> might be due to reasons correllated with the localhost entry in >> /etc/hosts as you can read in alot of posts with similar errors, >> but i checked the file neither localhost nor 127.0.0.1/127.0.1.1 >> <http://127.0.0.1/127.0.1.1> is bound there. (although you can >> ping localhost... the technician of the cluster said he'd be >> looking for the mechanics resolving localhost) >> * The other nodes can not speak with the namenode and jobtracker >> (its-cs131). Although it is absolutely not clear, why this is >> happening: the "dfs -put" i do directly before the job is running >> fine, which seems to imply that communication between those >> servers is working flawlessly. >> >> Is there any reason why this might happen? >> >> >> Regards, >> Elmar >> >> LOGS BELOW: >> >> \____Datanodes >> >> After successfully putting the data to hdfs (at this point i >> thought namenode and datanodes have to communicate), i get the >> following errors when starting the job: >> >> There are 2 kinds of logs i found: the first one is big (about >> 12MB) and looks like this: >> ############################### LOG TYPE 1 >> ############################################################ >> 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: >> Retrying connect to server: its-cs131/141.51.205.41:35554 >> <http://141.51.205.41:35554/>. Already tried 0 time(s). >> 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: >> Retrying connect to server: its-cs131/141.51.205.41:35554 >> <http://141.51.205.41:35554/>. Already tried 1 time(s). >> 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: >> Retrying connect to server: its-cs131/141.51.205.41:35554 >> <http://141.51.205.41:35554/>. Already tried 2 time(s). >> 2012-08-13 08:23:30,332 INFO org.apache.hadoop.ipc.Client: >> Retrying connect to server: its-cs131/141.51.205.41:35554 >> <http://141.51.205.41:35554/>. Already tried 3 time(s). >> 2012-08-13 08:23:31,333 INFO org.apache.hadoop.ipc.Client: >> Retrying connect to server: its-cs131/141.51.205.41:35554 >> <http://141.51.205.41:35554/>. Already tried 4 time(s). >> 2012-08-13 08:23:32,333 INFO org.apache.hadoop.ipc.Client: >> Retrying connect to server: its-cs131/141.51.205.41:35554 >> <http://141.51.205.41:35554/>. Already tried 5 time(s). >> 2012-08-13 08:23:33,334 INFO org.apache.hadoop.ipc.Client: +
Björn-Elmar Macek 2012-08-13, 13:17
-
Re: DataNode and Tasttracker communicationMichael Segel 2012-08-13, 13:36
Based on your /etc/hosts output, why aren't you using DNS?
Outside of MapR, multihomed machines can be problematic. Hadoop doesn't generally work well when you're not using the FQDN or its alias. The issue isn't the SSH, but if you go to the node which is having trouble connecting to another node, then try to ping it, or some other general communication, if it succeeds, your issue is that the port you're trying to communicate with is blocked. Then its more than likely an ipconfig or firewall issue. On Aug 13, 2012, at 8:17 AM, Björn-Elmar Macek <[EMAIL PROTECTED]> wrote: > Hi Michael, > > well i can ssh from any node to any other without being prompted. The reason for this is, that my home dir is mounted in every server in the cluster. > > If the machines are multihomed: i dont know. i could ask if this would be of importance. > > Shall i? > > Regards, > Elmar > > Am 13.08.12 14:59, schrieb Michael Segel: >> If the nodes can communicate and distribute data, then the odds are that the issue isn't going to be in his /etc/hosts. >> >> A more relevant question is if he's running a firewall on each of these machines? >> >> A simple test... ssh to one node, ping other nodes and the control nodes at random to see if they can see one another. Then check to see if there is a firewall running which would limit the types of traffic between nodes. >> >> One other side note... are these machines multi-homed? >> >> On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: >> >>> Hello there, >>> >>> Could you please share your /etc/hosts file, if you don't mind. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> i am currently trying to run my hadoop program on a cluster. Sadly though my datanodes and tasktrackers seem to have difficulties with their communication as their logs say: >>> * Some datanodes and tasktrackers seem to have portproblems of some kind as it can be seen in the logs below. I wondered if this might be due to reasons correllated with the localhost entry in /etc/hosts as you can read in alot of posts with similar errors, but i checked the file neither localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you can ping localhost... the technician of the cluster said he'd be looking for the mechanics resolving localhost) >>> * The other nodes can not speak with the namenode and jobtracker (its-cs131). Although it is absolutely not clear, why this is happening: the "dfs -put" i do directly before the job is running fine, which seems to imply that communication between those servers is working flawlessly. >>> >>> Is there any reason why this might happen? >>> >>> >>> Regards, >>> Elmar >>> >>> LOGS BELOW: >>> >>> \____Datanodes >>> >>> After successfully putting the data to hdfs (at this point i thought namenode and datanodes have to communicate), i get the following errors when starting the job: >>> >>> There are 2 kinds of logs i found: the first one is big (about 12MB) and looks like this: >>> ############################### LOG TYPE 1 ############################################################ >>> 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 0 time(s). >>> 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 1 time(s). >>> 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 2 time(s). >>> 2012-08-13 08:23:30,332 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 3 time(s). >>> 2012-08-13 08:23:31,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 4 time(s). >>> 2012-08-13 08:23:32,333 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: its-cs131/141.51.205.41:35554. Already tried 5 time(s). +
Michael Segel 2012-08-13, 13:36
-
Re: DataNode and Tasttracker communicationMohammad Tariq 2012-08-13, 14:12
Hi Michael,
I asked for hosts file because there seems to be some loopback prob to me. The log shows that call is going at 0.0.0.0. Apart from what you have said, I think disabling IPv6 and making sure that there is no prob with the DNS resolution is also necessary. Please correct me if I am wrong. Thank you. Regards, Mohammad Tariq On Mon, Aug 13, 2012 at 7:06 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > Based on your /etc/hosts output, why aren't you using DNS? > > Outside of MapR, multihomed machines can be problematic. Hadoop doesn't > generally work well when you're not using the FQDN or its alias. > > The issue isn't the SSH, but if you go to the node which is having trouble > connecting to another node, then try to ping it, or some other general > communication, if it succeeds, your issue is that the port you're trying > to communicate with is blocked. Then its more than likely an ipconfig or > firewall issue. > > On Aug 13, 2012, at 8:17 AM, Björn-Elmar Macek <[EMAIL PROTECTED]> > wrote: > > Hi Michael, > > well i can ssh from any node to any other without being prompted. The > reason for this is, that my home dir is mounted in every server in the > cluster. > > If the machines are multihomed: i dont know. i could ask if this would be > of importance. > > Shall i? > > Regards, > Elmar > > Am 13.08.12 14:59, schrieb Michael Segel: > > If the nodes can communicate and distribute data, then the odds are that > the issue isn't going to be in his /etc/hosts. > > A more relevant question is if he's running a firewall on each of these > machines? > > A simple test... ssh to one node, ping other nodes and the control nodes > at random to see if they can see one another. Then check to see if there is > a firewall running which would limit the types of traffic between nodes. > > One other side note... are these machines multi-homed? > > On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > > Hello there, > > Could you please share your /etc/hosts file, if you don't mind. > > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek <[EMAIL PROTECTED] > > wrote: > >> Hi, >> >> i am currently trying to run my hadoop program on a cluster. Sadly though >> my datanodes and tasktrackers seem to have difficulties with their >> communication as their logs say: >> * Some datanodes and tasktrackers seem to have portproblems of some kind >> as it can be seen in the logs below. I wondered if this might be due to >> reasons correllated with the localhost entry in /etc/hosts as you can read >> in alot of posts with similar errors, but i checked the file neither >> localhost nor 127.0.0.1/127.0.1.1 is bound there. (although you can ping >> localhost... the technician of the cluster said he'd be looking for the >> mechanics resolving localhost) >> * The other nodes can not speak with the namenode and jobtracker >> (its-cs131). Although it is absolutely not clear, why this is happening: >> the "dfs -put" i do directly before the job is running fine, which seems to >> imply that communication between those servers is working flawlessly. >> >> Is there any reason why this might happen? >> >> >> Regards, >> Elmar >> >> LOGS BELOW: >> >> \____Datanodes >> >> After successfully putting the data to hdfs (at this point i thought >> namenode and datanodes have to communicate), i get the following errors >> when starting the job: >> >> There are 2 kinds of logs i found: the first one is big (about 12MB) and >> looks like this: >> ############################### LOG TYPE 1 >> ############################################################ >> 2012-08-13 08:23:27,331 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to server: its-cs131/141.51.205.41:35554. Already tried 0 >> time(s). >> 2012-08-13 08:23:28,332 INFO org.apache.hadoop.ipc.Client: Retrying >> connect to server: its-cs131/141.51.205.41:35554. Already tried 1 >> time(s). >> 2012-08-13 08:23:29,332 INFO org.apache.hadoop.ipc.Client: Retrying +
Mohammad Tariq 2012-08-13, 14:12
-
Re: DataNode and Tasttracker communicationBjörn-Elmar Macek 2012-08-13, 14:57
Hi,
with "using DNS" you mean using the servers' non-IP-names, right? If so, i do use DNS. Since i am working in a SLURM enviroment and i get a list of nodes for evry job i schedule, i construct the config files for evry job by taking the list of assigned nodes and deviding the roles(NameNode,JobTracker,SecondaryNameNode,TaskTrackers,DataNodes) over this set of machines. SLURM offers me names like "its-cs<nodenumber>" which is enough for ssh to connect - maybe it isnt for all hadoop processes. The complete names would be "its-cs<nodenumber>.its.uni-kassel.de". I will add this part of the adress for testing. But i fear it wont help alot, cause the JobTracker's log seems to know the full names: ### 2012-08-13 01:12:02,770 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208130059_0001_m_000887 has split on node:/default-rack/its-cs202.its.uni-kassel.de 2012-08-13 01:12:02,770 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208130059_0001_m_000888 has split on node:/default-rack/its-cs202.its.uni-kassel.de 2012-08-13 01:12:02,770 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208130059_0001_m_000889 has split on node:/default-rack/its-cs195.its.uni-kassel.de 2012-08-13 01:12:02,770 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208130059_0001_m_000890 has split on node:/default-rack/its-cs196.its.uni-kassel.de 2012-08-13 01:12:02,770 INFO org.apache.hadoop.mapred.JobInProgress: tip:task_201208130059_0001_m_000891 has split on node:/default-rack/its-cs201.its.uni-kassel.de ### Pings work btw: i could ping the NameNode from all problematic nodes. And lsof -i didnt yield and other programs running on the NameNode/JobTracker node with the problematic ports. :( Maybe something to notice is, that after the NameNode/JobTracker server is atm not running anymore although the DataNode/TaskTracker logs are still growing. Concerning IPv6: as far as i can see i would have to modify global config files to dsiable it. Since i am only a user of this cluster with very limited insight in why the machines are configured the way they are, i want to be very careful with asking the technicians to make changes to their setup. I dont want to be respectless. I will try using the full names first and if this doesnt help, i will ofc ask them if no other options are available. Am 13.08.12 16:12, schrieb Mohammad Tariq: > Hi Michael, > I asked for hosts file because there seems to be some loopback > prob to me. The log shows that call is going at 0.0.0.0. Apart from > what you have said, I think disabling IPv6 and making sure that there > is no prob with the DNS resolution is also necessary. Please correct > me if I am wrong. Thank you. > > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 7:06 PM, Michael Segel > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Based on your /etc/hosts output, why aren't you using DNS? > > Outside of MapR, multihomed machines can be problematic. Hadoop > doesn't generally work well when you're not using the FQDN or its > alias. > > The issue isn't the SSH, but if you go to the node which is having > trouble connecting to another node, then try to ping it, or some > other general communication, if it succeeds, your issue is that > the port you're trying to communicate with is blocked. Then its > more than likely an ipconfig or firewall issue. > > On Aug 13, 2012, at 8:17 AM, Bj�rn-Elmar Macek > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > >> Hi Michael, >> >> well i can ssh from any node to any other without being prompted. >> The reason for this is, that my home dir is mounted in every >> server in the cluster. >> >> If the machines are multihomed: i dont know. i could ask if this >> would be of importance. >> >> Shall i? >> >> Regards, >> Elmar >> >> Am 13.08.12 14:59, schrieb Michael Segel: >>> If the nodes can communicate and distribute data, then the odds +
Björn-Elmar Macek 2012-08-13, 14:57
-
Re: DataNode and Tasttracker communicationJames Brown 2012-08-14, 06:51
Hi Bjorn,
For the two items below, it is possible datanodes and tasktrackers are already running. This command will show processes bound to the datanode port: netstat -putan | grep 50010 tasktracker port: netstat -putan | grep 50060 If your netstat command does not support the -p option try lsof. > \____Datanodes ... > The second is short kind: > ########################### LOG TYPE 2 > ############################################################ > 2012-08-13 00:59:19,038 INFO > org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: ... > 2012-08-13 00:59:21,898 ERROR > org.apache.hadoop.hdfs.server.datanode.DataNode: java.net.BindException: > Problem binding to /0.0.0.0:50010 : Address already in use ... > \_____TastTracker ... > ########################### LOG TYPE 2 > ############################################################ > 2012-08-13 00:59:24,376 INFO org.apache.hadoop.mapred.TaskTracker: > STARTUP_MSG: ... > 2012-08-13 00:59:38,161 ERROR org.apache.hadoop.mapred.TaskTracker: Can > not start task tracker because java.net.BindException: Address already > in use +
James Brown 2012-08-14, 06:51
-
Re: DataNode and Tasttracker communicationSriram Ramachandrasekaran... 2012-08-13, 16:37
the logs indicate already in use exception. is that some sign? :)
On 13 Aug 2012 20:36, "Mohammad Tariq" <[EMAIL PROTECTED]> wrote: > Thank you so very much for the detailed response Michael. I'll keep the > tip in mind. Please pardon my ignorance, as I am still in the learning > phase. > > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 8:29 PM, Michael Segel <[EMAIL PROTECTED]>wrote: > >> 0.0.0.0 means that the call is going to all interfaces on the machine. >> (Shouldn't be an issue...) >> >> IPv4 vs IPv6? Could be an issue, however OP says he can write data to DNs >> and they seem to communicate, therefore if its IPv6 related, wouldn't it >> impact all traffic and not just a specific port? >> I agree... shut down IPv6 if you can. >> >> I don't disagree with your assessment. I am just suggesting that before >> you do a really deep dive, you think about the more obvious stuff first. >> >> There are a couple of other things... like do all of the /etc/hosts files >> on all of the machines match? >> Is the OP using both /etc/hosts and DNS? If so, are they in sync? >> >> BTW, you said DNS in your response. if you're using DNS, then you don't >> really want to have much info in the /etc/hosts file except loopback and >> the server's IP address. >> >> Looking at the problem OP is indicating some traffic works, while other >> traffic doesn't. Most likely something is blocking the ports. Iptables is >> the first place to look. >> >> Just saying. ;-) >> >> >> On Aug 13, 2012, at 9:12 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: >> >> Hi Michael, >> I asked for hosts file because there seems to be some loopback >> prob to me. The log shows that call is going at 0.0.0.0. Apart from what >> you have said, I think disabling IPv6 and making sure that there is no prob >> with the DNS resolution is also necessary. Please correct me if I am wrong. >> Thank you. >> >> Regards, >> Mohammad Tariq >> >> >> >> On Mon, Aug 13, 2012 at 7:06 PM, Michael Segel <[EMAIL PROTECTED] >> > wrote: >> >>> Based on your /etc/hosts output, why aren't you using DNS? >>> >>> Outside of MapR, multihomed machines can be problematic. Hadoop doesn't >>> generally work well when you're not using the FQDN or its alias. >>> >>> The issue isn't the SSH, but if you go to the node which is having >>> trouble connecting to another node, then try to ping it, or some other >>> general communication, if it succeeds, your issue is that the port you're >>> trying to communicate with is blocked. Then its more than likely an >>> ipconfig or firewall issue. >>> >>> On Aug 13, 2012, at 8:17 AM, Björn-Elmar Macek <[EMAIL PROTECTED]> >>> wrote: >>> >>> Hi Michael, >>> >>> well i can ssh from any node to any other without being prompted. The >>> reason for this is, that my home dir is mounted in every server in the >>> cluster. >>> >>> If the machines are multihomed: i dont know. i could ask if this would >>> be of importance. >>> >>> Shall i? >>> >>> Regards, >>> Elmar >>> >>> Am 13.08.12 14:59, schrieb Michael Segel: >>> >>> If the nodes can communicate and distribute data, then the odds are that >>> the issue isn't going to be in his /etc/hosts. >>> >>> A more relevant question is if he's running a firewall on each of >>> these machines? >>> >>> A simple test... ssh to one node, ping other nodes and the control >>> nodes at random to see if they can see one another. Then check to see if >>> there is a firewall running which would limit the types of traffic between >>> nodes. >>> >>> One other side note... are these machines multi-homed? >>> >>> On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED]> >>> wrote: >>> >>> Hello there, >>> >>> Could you please share your /etc/hosts file, if you don't mind. >>> >>> Regards, >>> Mohammad Tariq >>> >>> >>> >>> On Mon, Aug 13, 2012 at 6:01 PM, Björn-Elmar Macek < >>> [EMAIL PROTECTED]> wrote: >>> >>>> Hi, >>>> >>>> i am currently trying to run my hadoop program on a cluster. Sadly +
Sriram Ramachandrasekaran... 2012-08-13, 16:37
-
Re: DataNode and Tasttracker communicationMichael Segel 2012-08-13, 20:39
The key is to think about what can go wrong, but start with the low hanging fruit. I mean you could be right, however you're jumping the gun and are over looking simpler issues. The most common issue is that the networking traffic is being filtered. Of course since we're both diagnosing this with minimal information, we're kind of shooting from the hip. This is why I'm asking if there is any networking traffic between the nodes. If you have partial communication, then focus on why you can't see the specific traffic. On Aug 13, 2012, at 10:05 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > Thank you so very much for the detailed response Michael. I'll keep the tip in mind. Please pardon my ignorance, as I am still in the learning phase. > > Regards, > Mohammad Tariq > > > > On Mon, Aug 13, 2012 at 8:29 PM, Michael Segel <[EMAIL PROTECTED]> wrote: > 0.0.0.0 means that the call is going to all interfaces on the machine. (Shouldn't be an issue...) > > IPv4 vs IPv6? Could be an issue, however OP says he can write data to DNs and they seem to communicate, therefore if its IPv6 related, wouldn't it impact all traffic and not just a specific port? > I agree... shut down IPv6 if you can. > > I don't disagree with your assessment. I am just suggesting that before you do a really deep dive, you think about the more obvious stuff first. > > There are a couple of other things... like do all of the /etc/hosts files on all of the machines match? > Is the OP using both /etc/hosts and DNS? If so, are they in sync? > > BTW, you said DNS in your response. if you're using DNS, then you don't really want to have much info in the /etc/hosts file except loopback and the server's IP address. > > Looking at the problem OP is indicating some traffic works, while other traffic doesn't. Most likely something is blocking the ports. Iptables is the first place to look. > > Just saying. ;-) > > > On Aug 13, 2012, at 9:12 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: > >> Hi Michael, >> I asked for hosts file because there seems to be some loopback prob to me. The log shows that call is going at 0.0.0.0. Apart from what you have said, I think disabling IPv6 and making sure that there is no prob with the DNS resolution is also necessary. Please correct me if I am wrong. Thank you. >> >> Regards, >> Mohammad Tariq >> >> >> >> On Mon, Aug 13, 2012 at 7:06 PM, Michael Segel <[EMAIL PROTECTED]> wrote: >> Based on your /etc/hosts output, why aren't you using DNS? >> >> Outside of MapR, multihomed machines can be problematic. Hadoop doesn't generally work well when you're not using the FQDN or its alias. >> >> The issue isn't the SSH, but if you go to the node which is having trouble connecting to another node, then try to ping it, or some other general communication, if it succeeds, your issue is that the port you're trying to communicate with is blocked. Then its more than likely an ipconfig or firewall issue. >> >> On Aug 13, 2012, at 8:17 AM, Björn-Elmar Macek <[EMAIL PROTECTED]> wrote: >> >>> Hi Michael, >>> >>> well i can ssh from any node to any other without being prompted. The reason for this is, that my home dir is mounted in every server in the cluster. >>> >>> If the machines are multihomed: i dont know. i could ask if this would be of importance. >>> >>> Shall i? >>> >>> Regards, >>> Elmar >>> >>> Am 13.08.12 14:59, schrieb Michael Segel: >>>> If the nodes can communicate and distribute data, then the odds are that the issue isn't going to be in his /etc/hosts. >>>> >>>> A more relevant question is if he's running a firewall on each of these machines? >>>> >>>> A simple test... ssh to one node, ping other nodes and the control nodes at random to see if they can see one another. Then check to see if there is a firewall running which would limit the types of traffic between nodes. >>>> >>>> One other side note... are these machines multi-homed? >>>> >>>> On Aug 13, 2012, at 7:51 AM, Mohammad Tariq <[EMAIL PROTECTED]> wrote: +
Michael Segel 2012-08-13, 20:39
-
Re: DataNode and Tasttracker communicationBjörn-Elmar Macek 2012-08-16, 13:17
Hello again,
well i have sorted out about all of doubts, that the communication problems are related to the infrastructures. Instead i found in a new execution of my program, that due to some unknown and untracked reasons the namenode and the tasktracker stop their services due to too many failed map tasks. See the logs below. From that time on ofc, the running datanodes/tasktracker cannot communicate with jobtracker/namenode. What i do not understand is, why the jobs do not answer or fail. I wanted to look it up in the logs, but somehow they do not contain anything from times prior to 24:00/0:00 o'clock - a time at which the master(s) were already dead for 2 hours. Are there any suggestions? Maybe did i do something wrong in the Mapper? Regards, Elmar #################################### JOBLOG ... LAST LINES ################################# Task attempt_201208152128_0001_m_000007_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:10 INFO mapred.JobClient: map 50% reduce 0% 12/08/15 21:50:12 INFO mapred.JobClient: map 39% reduce 0% 12/08/15 21:50:13 INFO mapred.JobClient: map 23% reduce 0% 12/08/15 21:50:14 INFO mapred.JobClient: map 19% reduce 0% 12/08/15 21:50:15 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000014_1, Status : FAILED Task attempt_201208152128_0001_m_000014_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:15 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000015_1, Status : FAILED Task attempt_201208152128_0001_m_000015_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:17 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000004_1, Status : FAILED Task attempt_201208152128_0001_m_000004_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:17 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000005_1, Status : FAILED Task attempt_201208152128_0001_m_000005_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:17 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000012_1, Status : FAILED Task attempt_201208152128_0001_m_000012_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:18 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000008_1, Status : FAILED Task attempt_201208152128_0001_m_000008_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:19 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000009_1, Status : FAILED Task attempt_201208152128_0001_m_000009_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:19 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000000_1, Status : FAILED Task attempt_201208152128_0001_m_000000_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:19 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000010_1, Status : FAILED Task attempt_201208152128_0001_m_000010_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:20 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000002_1, Status : FAILED Task attempt_201208152128_0001_m_000002_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:21 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000003_1, Status : FAILED Task attempt_201208152128_0001_m_000003_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:22 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000013_1, Status : FAILED Task attempt_201208152128_0001_m_000013_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:22 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000001_1, Status : FAILED Task attempt_201208152128_0001_m_000001_1 failed to report status for 601 seconds. Killing! 12/08/15 21:50:23 INFO mapred.JobClient: map 11% reduce 0% 12/08/15 21:50:25 INFO mapred.JobClient: map 17% reduce 0% 12/08/15 21:50:27 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000006_1, Status : FAILED Task attempt_201208152128_0001_m_000006_1 failed to report status for 602 seconds. Killing! 12/08/15 21:50:27 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000013_2, Status : FAILED Task attempt_201208152128_0001_m_000013_2 failed to report status for 602 seconds. Killing! 12/08/15 21:50:27 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000001_2, Status : FAILED Task attempt_201208152128_0001_m_000001_2 failed to report status for 602 seconds. Killing! 12/08/15 21:50:28 INFO mapred.JobClient: map 40% reduce 0% 12/08/15 21:50:29 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000011_2, Status : FAILED Task attempt_201208152128_0001_m_000011_2 failed to report status for 601 seconds. Killing! 12/08/15 21:50:30 INFO mapred.JobClient: map 42% reduce 0% 12/08/15 21:50:31 INFO mapred.JobClient: map 52% reduce 0% 12/08/15 21:50:33 INFO mapred.JobClient: map 54% reduce 0% 12/08/15 21:50:37 INFO mapred.JobClient: map 58% reduce 0% 12/08/15 21:50:39 INFO mapred.JobClient: map 61% reduce 0% 12/08/15 21:50:42 INFO mapred.JobClient: map 62% reduce 0% 12/08/15 21:50:46 INFO mapred.JobClient: map 58% reduce 0% 12/08/15 21:50:55 INFO mapred.JobClient: map 54% reduce 0% 12/08/15 21:50:57 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000011_1, Status : FAILED Task attempt_201208152128_0001_m_000011_1 failed to report status for 602 seconds. Killing! 12/08/15 21:52:10 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000006_2, Status : FAILED Task attempt_201208152128_0001_m_000006_2 failed to report status for 602 seconds. Killing! 12/08/15 22:00:25 INFO mapred.JobClient: Task Id : attempt_201208152128_0001_m_000007_2, Status : FAILED Task attempt_201208152128_0001_m_000007_2 failed to report status for 602 seconds. Killing! 12/08/15 22:00:29 INFO mapred.JobClient: map 50% reduce 0% 12/08/15 22:00:32 INFO mapred.JobClient: map 46% reduce 0% 12/08/15 22:00:34 INFO mapred.Job +
Björn-Elmar Macek 2012-08-16, 13:17
-
Re: DataNode and Tasttracker communicationBjörn-Elmar Macek 2012-08-20, 10:15
Ok, to give to you the solution to the namespace errors on the
datanodes, the startup and the communication problem between datanodes/tasktracker and namenode/jobtracker i did the following: As you can read on several sites: there are 2 strategies for fixing datanode namespaces. since i like to delete old stuff, cause it seems more reliable to me i wrote this script which can be called anytime to fix namespaces in an arbitrary complex enviroment: ############ SCRIPT OVER HERE########## #!/bin/sh ~/hadoop-1.0.2/bin/stop-all.sh rm curclean.sh sleep 3 echo "#!/bin/sh" > curclean.sh while read line do echo "ssh '$line' 'rm -rf /home/work/bmacek/hadoop/hdfs/slave" >> curclean.sh done < "/home/fb16/bmacek/hadoop-1.0.2/conf/slaves" /home/fb16/bmacek/curclean.sh sleep 3 ssh $(< ~/hadoop-1.0.2/conf/namenode) "~/hadoop-1.0.2/bin/hadoop namenode -format" ##################################### !!! WARNING ADAPT PATHS !!! The next two problems could be avoided by setting the following properties in mapred-site.xml ############## FIX PORT PROBLEMS FOR SLAVES ############# <property> <name>mapred.task.tracker.http.address</name> <value>0.0.0.0:0</value> </property> <property> <name>dfs.datanode.port</name> <value>0 </property> For people who are working with huge data i strongly recommend using: <property> <name>mapred.task.timeout</name> <value>0</value> </property> Otherwise your job might fail due to reasons which you dont want to influence the jobexecution. So much from me ... for now. ;) Best regards and thanks for having a look into my problems here and there. Bj�rn +
Björn-Elmar Macek 2012-08-20, 10:15
|