|
Guy Doulberg
2011-04-05, 06:53
elton sky
2011-04-05, 07:17
Guy Doulberg
2011-04-05, 09:54
Ted Dunning
2011-04-05, 15:45
Guy Doulberg
2011-04-06, 06:34
Ted Dunning
2011-04-06, 07:55
Guy Doulberg
2011-04-06, 08:00
|
-
We are looking to the root of the problem that caused us IOExceptionGuy Doulberg 2011-04-05, 06:53
Hey guys,
We are trying to figure out why many of our Map/Reduce job on the cluster are failing. In log we are getting this message I n the failing jobs: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File **a filename*** could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1282) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469) at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) at org.apache.hadoop.ipc.Client.call(Client.java:818) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) at $Proxy1.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2932) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2807) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2087) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2274) Where should we look? What are the candidates to be the root of this message? Thanks, Guy
-
Re: We are looking to the root of the problem that caused us IOExceptionelton sky 2011-04-05, 07:17
check the FAQ (
http://wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F ) On Tue, Apr 5, 2011 at 4:53 PM, Guy Doulberg <[EMAIL PROTECTED]>wrote: > Hey guys, > > We are trying to figure out why many of our Map/Reduce job on the cluster > are failing. > In log we are getting this message I n the failing jobs: > > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: File **a > filename*** could only be replicated to 0 nodes, instead of 1 > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1282) > > at > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469) > > at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:818) > > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) > > at $Proxy1.addBlock(Unknown Source) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > > at $Proxy1.addBlock(Unknown Source) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2932) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2807) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2087) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2274) > > > > Where should we look? > What are the candidates to be the root of this message? > > Thanks, Guy > > > >
-
RE: We are looking to the root of the problem that caused us IOExceptionGuy Doulberg 2011-04-05, 09:54
Thanks,
We think the problem is, We have unbalanced HDFS cluster, some of the data nodes are in more 90%, and some are less than 30% - it happened because the nodes with free space are newer. We think that when a task tracker is getting a task, it tries to write its map output first to its local data node, and since many of the nodes are full, the task tracker fails. Does this diagnosis sounds logical? Are there workarounds? We are running the blancer, but it takes a lot of time... in this time the cluster not working We are using the CDH2 of cloudera Thanks -----Original Message----- From: elton sky [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 05, 2011 10:18 AM To: [EMAIL PROTECTED] Subject: Re: We are looking to the root of the problem that caused us IOException check the FAQ ( http://wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F ) On Tue, Apr 5, 2011 at 4:53 PM, Guy Doulberg <[EMAIL PROTECTED]>wrote: > Hey guys, > > We are trying to figure out why many of our Map/Reduce job on the cluster > are failing. > In log we are getting this message I n the failing jobs: > > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: File **a > filename*** could only be replicated to 0 nodes, instead of 1 > > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1282) > > at > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:469) > > at sun.reflect.GeneratedMethodAccessor29.invoke(Unknown Source) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:512) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:968) > > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:964) > > at java.security.AccessController.doPrivileged(Native Method) > > at javax.security.auth.Subject.doAs(Subject.java:396) > > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:962) > > > > at org.apache.hadoop.ipc.Client.call(Client.java:818) > > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:221) > > at $Proxy1.addBlock(Unknown Source) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > > at java.lang.reflect.Method.invoke(Method.java:597) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) > > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) > > at $Proxy1.addBlock(Unknown Source) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2932) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2807) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2087) > > at > org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2274) > > > > Where should we look? > What are the candidates to be the root of this message? > > Thanks, Guy > > > >
-
Re: We are looking to the root of the problem that caused us IOExceptionTed Dunning 2011-04-05, 15:45
YOu can configure the balancer to use higher bandwidth. That can speed it
up by 10x On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg <[EMAIL PROTECTED]>wrote: > We are running the blancer, but it takes a lot of time... in this time the > cluster not working >
-
RE: We are looking to the root of the problem that caused us IOExceptionGuy Doulberg 2011-04-06, 06:34
Thanks,
That is what we actually did, Worked! We are back on track... Do you think we should always run the balance, with low bandwidth, and not only after adding new nodes? From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 05, 2011 6:46 PM To: [EMAIL PROTECTED] Cc: Guy Doulberg Subject: Re: We are looking to the root of the problem that caused us IOException YOu can configure the balancer to use higher bandwidth. That can speed it up by 10x On Tue, Apr 5, 2011 at 2:54 AM, Guy Doulberg <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: We are running the blancer, but it takes a lot of time... in this time the cluster not working
-
Re: We are looking to the root of the problem that caused us IOExceptionTed Dunning 2011-04-06, 07:55
yes. At least periodically.
You now have a situation where the age distribution of blocks in each datanode is quite different. This will lead to different evolution of which files are retained and that is likely to cause imbalances again. It will also cause the performance of your system to be degraded since any given program probably will have a non-uniform distribution of content in your cluster. Over time, this effect will probably decrease unless you keep your old files forever. On Tue, Apr 5, 2011 at 11:34 PM, Guy Doulberg <[EMAIL PROTECTED]>wrote: > Do you think we should always run the balance, with low bandwidth, and not > only after adding new nodes? >
-
RE: We are looking to the root of the problem that caused us IOExceptionGuy Doulberg 2011-04-06, 08:00
Great,
Thanks From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 06, 2011 10:55 AM To: [EMAIL PROTECTED] Cc: Guy Doulberg Subject: Re: We are looking to the root of the problem that caused us IOException yes. At least periodically. You now have a situation where the age distribution of blocks in each datanode is quite different. This will lead to different evolution of which files are retained and that is likely to cause imbalances again. It will also cause the performance of your system to be degraded since any given program probably will have a non-uniform distribution of content in your cluster. Over time, this effect will probably decrease unless you keep your old files forever. On Tue, Apr 5, 2011 at 11:34 PM, Guy Doulberg <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Do you think we should always run the balance, with low bandwidth, and not only after adding new nodes? |