|
Eran Kutner
2011-04-10, 15:07
Stack
2011-04-10, 18:54
Eran Kutner
2011-04-10, 19:05
Stack
2011-04-10, 19:20
Jean-Daniel Cryans
2011-04-10, 19:27
Eran Kutner
2011-04-11, 06:30
Stack
2011-04-11, 18:00
Eran Kutner
2011-04-11, 18:22
Jean-Daniel Cryans
2011-04-11, 18:47
Jean-Daniel Cryans
2011-04-11, 20:53
Eran Kutner
2011-04-14, 06:53
|
-
Cluster crashEran Kutner 2011-04-10, 15:07
Hi,
While doing load testing on HBase the entire cluster crashed with errors like these in hbase logs: 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1213779416283711358_54194 bad datanode[0] 10.1.104.1:50010 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1213779416283711358_54194 in pipeline 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode 10.1.104.1:50010 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #2 from primary datanode 10.1.104.2:50010 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block (=blk_1213779416283711358_54194) not found at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099) at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703) at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1104) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy4.commitBlockSynchronization(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1104) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy8.recoverBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461) 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1213779416283711358_54194 failed because recovery from primary datanode 10.1.104.2:50010 failed 3 times. Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010. Will retry... 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1213779416283711358_54194 bad datanode[0] 10.1.104.1:50010 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error Recovery for block blk_1213779416283711358_54194 in pipeline 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode 10.1.104.1:50010 2011-04-10 10:14:31,905 WARN org.apache.hadoop.hdfs.DFSClient: Failed recovery attempt #3 from primary datanode 10.1.104.2:50010 org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block (=blk_1213779416283711358_54194) not found at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099) at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703) at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) at org.apache.hadoop.ipc.Client.call(Client.java:1104) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) at $Proxy4.commitBlockSynchronization(Unknown Source) at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828) at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924) at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557)
-
Re: Cluster crashStack 2011-04-10, 18:54
Did you read the requirements section [1] and verify that indeed
ulimit and nprocs for the user who owns hbase and hadoop processes has indeed the upped limits? Yours, St.Ack 1. http://hbase.apache.org/book/notsoquick.html#requirements On Sun, Apr 10, 2011 at 8:07 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Hi, > While doing load testing on HBase the entire cluster crashed with > errors like these in hbase logs: > > 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 bad datanode[0] > 10.1.104.1:50010 > 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 in pipeline > 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode > 10.1.104.1:50010 > 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed > recovery attempt #2 from primary datanode 10.1.104.2:50010 > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block > (=blk_1213779416283711358_54194) not found > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099) > at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703) > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) > > at org.apache.hadoop.ipc.Client.call(Client.java:1104) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy4.commitBlockSynchronization(Unknown Source) > at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847) > at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828) > at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) > > at org.apache.hadoop.ipc.Client.call(Client.java:1104) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy8.recoverBlock(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461) > 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 failed because > recovery from primary datanode 10.1.104.2:50010 failed 3 times. > Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010.
-
Re: Cluster crashEran Kutner 2011-04-10, 19:05
This is how it's configured in /etc/security/limits.con on all the
slaves in the cluster: hadoop - nofile 32768 hdfs - nofile 32768 hbase - nofile 32768 hadoop - nproc 32000 hdfs - nproc 32000 hbase - nproc 32000 When hbase is loading it prints: ulimit -n 32768 -eran On Sun, Apr 10, 2011 at 21:54, Stack <[EMAIL PROTECTED]> wrote: > Did you read the requirements section [1] and verify that indeed > ulimit and nprocs for the user who owns hbase and hadoop processes has > indeed the upped limits? > > Yours, > St.Ack > > 1. http://hbase.apache.org/book/notsoquick.html#requirements > > On Sun, Apr 10, 2011 at 8:07 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: >> Hi, >> While doing load testing on HBase the entire cluster crashed with >> errors like these in hbase logs: >> >> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error >> Recovery for block blk_1213779416283711358_54194 bad datanode[0] >> 10.1.104.1:50010 >> 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error >> Recovery for block blk_1213779416283711358_54194 in pipeline >> 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode >> 10.1.104.1:50010 >> 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed >> recovery attempt #2 from primary datanode 10.1.104.2:50010 >> org.apache.hadoop.ipc.RemoteException: >> org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block >> (=blk_1213779416283711358_54194) not found >> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099) >> at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703) >> at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1104) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >> at $Proxy4.commitBlockSynchronization(Unknown Source) >> at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847) >> at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828) >> at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924) >> at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) >> at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >> at java.lang.reflect.Method.invoke(Method.java:597) >> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) >> at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) >> at java.security.AccessController.doPrivileged(Native Method) >> at javax.security.auth.Subject.doAs(Subject.java:396) >> at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) >> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) >> >> at org.apache.hadoop.ipc.Client.call(Client.java:1104) >> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) >> at $Proxy8.recoverBlock(Unknown Source)
-
Re: Cluster crashStack 2011-04-10, 19:20
What versions of the software? I see you searched the block in DN.
Whats it say if you search it in NN logs? Thanks, St.Ack On Sun, Apr 10, 2011 at 8:07 AM, Eran Kutner <[EMAIL PROTECTED]> wrote: > Hi, > While doing load testing on HBase the entire cluster crashed with > errors like these in hbase logs: > > 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 bad datanode[0] > 10.1.104.1:50010 > 2011-04-10 10:14:30,844 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 in pipeline > 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010: bad datanode > 10.1.104.1:50010 > 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Failed > recovery attempt #2 from primary datanode 10.1.104.2:50010 > org.apache.hadoop.ipc.RemoteException: > org.apache.hadoop.ipc.RemoteException: java.io.IOException: Block > (=blk_1213779416283711358_54194) not found > at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.commitBlockSynchronization(FSNamesystem.java:2099) > at org.apache.hadoop.hdfs.server.namenode.NameNode.commitBlockSynchronization(NameNode.java:703) > at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) > > at org.apache.hadoop.ipc.Client.call(Client.java:1104) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy4.commitBlockSynchronization(Unknown Source) > at org.apache.hadoop.hdfs.server.datanode.DataNode.syncBlock(DataNode.java:1847) > at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1828) > at org.apache.hadoop.hdfs.server.datanode.DataNode.recoverBlock(DataNode.java:1924) > at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:557) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1416) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1412) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:396) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1115) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1410) > > at org.apache.hadoop.ipc.Client.call(Client.java:1104) > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:226) > at $Proxy8.recoverBlock(Unknown Source) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2785) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$1600(DFSClient.java:2289) > at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2461) > 2011-04-10 10:14:30,880 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 failed because > recovery from primary datanode 10.1.104.2:50010 failed 3 times. > Pipeline was 10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010. > Will retry... > 2011-04-10 10:14:31,882 WARN org.apache.hadoop.hdfs.DFSClient: Error > Recovery for block blk_1213779416283711358_54194 bad datanode[0]
-
Re: Cluster crashJean-Daniel Cryans 2011-04-10, 19:27
My experience in debugging those kind of issues is that 95% of the
time it's a configuration issue, 4.99% of the time it's environment and network issues (network splits, lost packets, etc), and the remaining 0.01% is actual HDFS issues. The fact that you're saying that you had issues even with no load makes me think it's a configuration issue. Can we see your hdfs config? BTW the HBase log was pointing at 10.1.104.1 as the one having an issue, is that the log we are looking at? (it doesn't seem so) Thx, J-D On Sun, Apr 10, 2011 at 12:05 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: > This is how it's configured in /etc/security/limits.con on all the > slaves in the cluster: > hadoop - nofile 32768 > hdfs - nofile 32768 > hbase - nofile 32768 > hadoop - nproc 32000 > hdfs - nproc 32000 > hbase - nproc 32000 > > When hbase is loading it prints: > ulimit -n 32768 > > > -eran
-
Re: Cluster crashEran Kutner 2011-04-11, 06:30
Hi St.Ack and J-D,
Thanks for looking into this. It can definitely be a configuration problem, but I seriously doubt it is a network or infrastructure problem. It's our own operated infrastructure (not a cloud) and we have a lot of other services running on it without any problem. Note that Hbase is complaining about multiple data nodes (10.1.104.1, 10.1.104.2, 10.1.104.5), I attached the logs from just one of them but it's more or less the same on all. Please see the NN log for the same block below. We are using Hadoop 0.20.2-CDH3B4 and Hbase Version 0.90.2-SNAPSHOT, rUnknown, Wed Mar 23 06:09:51 EDT 2011 (I built that from the 0.90.2 branch to try to fix a problem with replication) This is the HDFS config file: # cat /etc/hadoop/conf/hdfs-site.xml <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>dfs.replication</name> <value>3</value> </property> <property> <name>dfs.permissions</name> <value>false</value> </property> <property> <!-- specify this so that running 'hadoop namenode -format' formats the right dir --> <name>dfs.name.dir</name> <value>/hadoop/dfs/name,/hadoop/dfs/name2</value> </property> <property> <name>dfs.data.dir</name> <value>/hadoop/dfs/data/</value> </property> <property> <name>fs.checkpoint.dir</name> <value>/hadoop/dfs/namesecondary/</value> </property> <property> <name>dfs.datanode.max.xcievers</name> <value>4096</value> </property> </configuration> This is the NN log for the same block: 2011-04-10 07:29:23,835 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.allocateBlock: /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279. blk_1213779416283711358_54194 2011-04-10 10:12:55,749 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* blk_1213779416283711358_54194 recovery started, primary=10.1.104.1:50010 2011-04-10 10:12:58,292 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: Targets updated: block blk_1213779416283711358_54249 on 10.1.104.1:50010 is added as a target for block blk_1213779416283711358_54194 with size 162696 2011-04-10 10:12:58,293 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: Targets updated: block blk_1213779416283711358_54249 on 10.1.104.5:50010 is added as a target for block blk_1213779416283711358_54194 with size 162696 2011-04-10 10:12:58,294 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: Targets updated: block blk_1213779416283711358_54249 on 10.1.104.2:50010 is added as a target for block blk_1213779416283711358_54194 with size 162696 2011-04-10 10:12:58,295 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_1213779416283711358_54194, newgenerationstamp=54249, newlength=162696, newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010], closeFile=true, deleteBlock=false) 2011-04-10 10:12:58,340 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(newblock=blk_1213779416283711358_54249, file=/hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279, newgenerationstamp=54249, newlength=162696, newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010]) successful 2011-04-10 10:12:58,773 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(lastblock=blk_1213779416283711358_54249, newgenerationstamp=54250, newlength=162696, newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010], closeFile=false, deleteBlock=false) 2011-04-10 10:12:58,798 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: commitBlockSynchronization(blk_1213779416283711358_54250) successful 2011-04-10 10:12:58,833 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.2:50010 is added to blk_1213779416283711358_54250 size 162696 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.5:50010 is added to blk_1213779416283711358_54250 size 162696 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.1:50010 is added to blk_1213779416283711358_54250 size 162696 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added to invalidSet of 10.1.104.2:50010 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added to invalidSet of 10.1.104.5:50010 2011-04-10 10:13:28,277 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_1213779416283711358 is added to invalidSet of 10.1.104.1:50010 2011-04-10 10:13:29,811 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1.104.2:50010 to delete blk_-826484977486799080_54240 blk_-3859382357539098123_54247 blk_6239319017569949674_54242 blk_2498532690928849052_54243 blk_-4133675075382826161_54234 blk_-6299622736350943739_54246 blk_1213779416283711358_54250 blk_-7407505773873780456_54244 blk_1070089292787245089_54237 2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1.104.1:50010 to delete blk_-826484977486799080_54240 blk_-9184330013088281910_54231 blk_842002398362359267_54239 blk_6239319017569949674_54242 blk_-4133675075382826161_54234 blk_-2420408100786396293_54245 blk_1213779416283711358_54250 blk_1475680055217098335_54241 blk_-8623426430834046353_54236 blk_-3859382357539098123_54247 blk_2498532690928849052_54243 blk_-6299622736350943739_54246 blk_-7407505773873780456_54244 blk_1070089292787245089_54237 2011-04-10 10:13:32,813 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* ask 10.1.104.5:50010 to d
-
Re: Cluster crashStack 2011-04-11, 18:00
On Sun, Apr 10, 2011 at 11:30 PM, Eran Kutner <[EMAIL PROTECTED]> wrote:
> Hi St.Ack and J-D, > Thanks for looking into this. > > It can definitely be a configuration problem, but I seriously doubt it > is a network or infrastructure problem. It's our own operated > infrastructure (not a cloud) and we have a lot of other services > running on it without any problem. Services that could be stealing i/o and cpu from hbase cluster? Is that possible? > Note that Hbase is complaining > about multiple data nodes (10.1.104.1, 10.1.104.2, 10.1.104.5), I > attached the logs from just one of them but it's more or less the same > on all. Please see the NN log for the same block below. > An attachment? Did it come through? Perhaps pastebin it and then add link here? > We are using Hadoop 0.20.2-CDH3B4 > and Hbase Version 0.90.2-SNAPSHOT, rUnknown, Wed Mar 23 06:09:51 EDT > 2011 (I built that from the 0.90.2 branch to try to fix a problem with > replication) > > </configuration> > Your configuration looks fine. Can you associate the block with a file? I don't see the association in the below. I see us trying to delete the block (would like to know why -- file deleteion?) and then it does exist on .2 for whatever reason. St.Ack > > This is the NN log for the same block: > 2011-04-10 07:29:23,835 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* NameSystem.allocateBlock: > /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279. > blk_1213779416283711358_54194 > 2011-04-10 10:12:55,749 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* blk_1213779416283711358_54194 recovery started, > primary=10.1.104.1:50010 > 2011-04-10 10:12:58,292 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* NameSystem.addStoredBlock: Targets updated: block > blk_1213779416283711358_54249 on 10.1.104.1:50010 is added as a target > for block blk_1213779416283711358_54194 with size 162696 > 2011-04-10 10:12:58,293 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* NameSystem.addStoredBlock: Targets updated: block > blk_1213779416283711358_54249 on 10.1.104.5:50010 is added as a target > for block blk_1213779416283711358_54194 with size 162696 > 2011-04-10 10:12:58,294 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* NameSystem.addStoredBlock: Targets updated: block > blk_1213779416283711358_54249 on 10.1.104.2:50010 is added as a target > for block blk_1213779416283711358_54194 with size 162696 > 2011-04-10 10:12:58,295 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194, > newgenerationstamp=54249, newlength=162696, > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010], > closeFile=true, deleteBlock=false) > 2011-04-10 10:12:58,340 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(newblock=blk_1213779416283711358_54249, > file=/hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279, > newgenerationstamp=54249, newlength=162696, > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010]) > successful > 2011-04-10 10:12:58,773 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(lastblock=blk_1213779416283711358_54249, > newgenerationstamp=54250, newlength=162696, > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010], > closeFile=false, deleteBlock=false) > 2011-04-10 10:12:58,798 INFO > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > commitBlockSynchronization(blk_1213779416283711358_54250) successful > 2011-04-10 10:12:58,833 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.2:50010 > is added to blk_1213779416283711358_54250 size 162696 > 2011-04-10 10:12:58,834 INFO org.apache.hadoop.hdfs.StateChange: > BLOCK* NameSystem.addStoredBlock: blockMap updated: 10.1.104.5:50010 > is added to blk_1213779416283711358_54250 size 162696
-
Re: Cluster crashEran Kutner 2011-04-11, 18:22
There wasn't an attachment, I pasted all the lines from all the NN logs that
contain that particular block number inline. As for CPU/IO, first there is nothing else running on those servers, second, CPU utilization on the slaves at peak load was around 40% and disk IO utilization less than 20%. That's the strange thing about it (I have another thread going about the performance), there is no bottleneck I could identify and yet the performance was relatively low, compared to the numbers I see quoted for HBase in other places. The first line of the NN log says: BLOCK* NameSystem.allocateBlock: /hbase/.logs/hadoop1-s01.farm-ny.gigya.com ,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com %3A60020.1302434963279.blk_1213779416283711358_54194 So it looks like a file name is: /hbase/.logs/hadoop1-s01.farm-ny.gigya.com ,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279 Is there a better way to associate a file with a block? -eran On Mon, Apr 11, 2011 at 21:00, Stack <[EMAIL PROTECTED]> wrote: > On Sun, Apr 10, 2011 at 11:30 PM, Eran Kutner <[EMAIL PROTECTED]> wrote: > > Hi St.Ack and J-D, > > Thanks for looking into this. > > > > It can definitely be a configuration problem, but I seriously doubt it > > is a network or infrastructure problem. It's our own operated > > infrastructure (not a cloud) and we have a lot of other services > > running on it without any problem. > > Services that could be stealing i/o and cpu from hbase cluster? Is > that possible? > > > Note that Hbase is complaining > > about multiple data nodes (10.1.104.1, 10.1.104.2, 10.1.104.5), I > > attached the logs from just one of them but it's more or less the same > > on all. Please see the NN log for the same block below. > > > > An attachment? Did it come through? Perhaps pastebin it and then add link > here? > > > > We are using Hadoop 0.20.2-CDH3B4 > > and Hbase Version 0.90.2-SNAPSHOT, rUnknown, Wed Mar 23 06:09:51 EDT > > 2011 (I built that from the 0.90.2 branch to try to fix a problem with > > replication) > > > > > > > > </configuration> > > > > Your configuration looks fine. > > Can you associate the block with a file? I don't see the association > in the below. I see us trying to delete the block (would like to know > why -- file deleteion?) and then it does exist on .2 for whatever > reason. > > St.Ack > > > > > This is the NN log for the same block: > > 2011-04-10 07:29:23,835 INFO org.apache.hadoop.hdfs.StateChange: > > BLOCK* NameSystem.allocateBlock: > > /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/ > hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279. > > blk_1213779416283711358_54194 > > 2011-04-10 10:12:55,749 INFO org.apache.hadoop.hdfs.StateChange: > > BLOCK* blk_1213779416283711358_54194 recovery started, > > primary=10.1.104.1:50010 > > 2011-04-10 10:12:58,292 INFO org.apache.hadoop.hdfs.StateChange: > > BLOCK* NameSystem.addStoredBlock: Targets updated: block > > blk_1213779416283711358_54249 on 10.1.104.1:50010 is added as a target > > for block blk_1213779416283711358_54194 with size 162696 > > 2011-04-10 10:12:58,293 INFO org.apache.hadoop.hdfs.StateChange: > > BLOCK* NameSystem.addStoredBlock: Targets updated: block > > blk_1213779416283711358_54249 on 10.1.104.5:50010 is added as a target > > for block blk_1213779416283711358_54194 with size 162696 > > 2011-04-10 10:12:58,294 INFO org.apache.hadoop.hdfs.StateChange: > > BLOCK* NameSystem.addStoredBlock: Targets updated: block > > blk_1213779416283711358_54249 on 10.1.104.2:50010 is added as a target > > for block blk_1213779416283711358_54194 with size 162696 > > 2011-04-10 10:12:58,295 INFO > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem: > > commitBlockSynchronization(lastblock=blk_1213779416283711358_54194, > > newgenerationstamp=54249, newlength=162696, > > newtargets=[10.1.104.1:50010, 10.1.104.5:50010, 10.1.104.2:50010], > > closeFile=true, deleteBlock=false) > > 2011-04-10 10:12:58,340 INFO > > org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
-
Re: Cluster crashJean-Daniel Cryans 2011-04-11, 18:47
So my understanding is that this log file was opened at 7:29 and then
something happened at 10:12:55 as something triggered the recovery on that block. It triggered a recovery of the block with the new name being blk_1213779416283711358_54249 It seems that that process was started by the DFS Client at 10:12:55 but the RS log starts at 10:14. Would it be possible to see what was before that? Also it would be nice to have a view for those blocks on all the datanodes. It would be nice to do this debugging on IRC is it can require a lot of back and forth. J-D On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <eran@.com> wrote: > There wasn't an attachment, I pasted all the lines from all the NN logs that > contain that particular block number inline. > > As for CPU/IO, first there is nothing else running on those servers, second, > CPU utilization on the slaves at peak load was around 40% and disk IO > utilization less than 20%. That's the strange thing about it (I have another > thread going about the performance), there is no bottleneck I could identify > and yet the performance was relatively low, compared to the numbers I see > quoted for HBase in other places. > > The first line of the NN log says: > BLOCK* NameSystem.allocateBlock: > /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.blk_1213779416283711358_54194 > So it looks like a file name is: > /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279 > > Is there a better way to associate a file with a block? > > -eran > > >
-
Re: Cluster crashJean-Daniel Cryans 2011-04-11, 20:53
Alright so I was able to get the logs from Eran, the HDFS errors are a
red herring, what followed in the region server log that is really important is: 2011-04-10 10:14:27,278 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 144490ms for sessionid 0x12ee42283320050, closing socket connection and attempting reconnect Which is a 2m20s GC pause. The HDFS errors come from the fact that the master split the logs _while_ the region server was sleeping. J-D On Mon, Apr 11, 2011 at 11:47 AM, Jean-Daniel Cryans <[EMAIL PROTECTED]> wrote: > So my understanding is that this log file was opened at 7:29 and then > something happened at 10:12:55 as something triggered the recovery on > that block. It triggered a recovery of the block with the new name > being blk_1213779416283711358_54249 > > It seems that that process was started by the DFS Client at 10:12:55 > but the RS log starts at 10:14. Would it be possible to see what was > before that? Also it would be nice to have a view for those blocks on > all the datanodes. > > It would be nice to do this debugging on IRC is it can require a lot > of back and forth. > > J-D > > On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <eran@.com> wrote: >> There wasn't an attachment, I pasted all the lines from all the NN logs that >> contain that particular block number inline. >> >> As for CPU/IO, first there is nothing else running on those servers, second, >> CPU utilization on the slaves at peak load was around 40% and disk IO >> utilization less than 20%. That's the strange thing about it (I have another >> thread going about the performance), there is no bottleneck I could identify >> and yet the performance was relatively low, compared to the numbers I see >> quoted for HBase in other places. >> >> The first line of the NN log says: >> BLOCK* NameSystem.allocateBlock: >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279.blk_1213779416283711358_54194 >> So it looks like a file name is: >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279 >> >> Is there a better way to associate a file with a block? >> >> -eran >> >> >> >
-
Re: Cluster crashEran Kutner 2011-04-14, 06:53
After 3 days of running with the configuration changes recommended by J-D
the cluster seems stable now. For the benefit of others I would say there were two issues identified: First, the HBASE_HEAP was set too high. It turns out that each Haddop daemon takes at least 1GB at startup even if it's doing nothing. Since we have a data node, a task tracker and a thrift server running on each machine those take up 3GB or RAM that must be accounted for when allocating memory for the region server. Second, we had "-XX:+CMSIncrementalMode" configured, which apparently is not good with multi-core systems. Thanks J-D for all the help. -eran On Mon, Apr 11, 2011 at 23:53, Jean-Daniel Cryans <[EMAIL PROTECTED]>wrote: > Alright so I was able to get the logs from Eran, the HDFS errors are a > red herring, what followed in the region server log that is really > important is: > > 2011-04-10 10:14:27,278 INFO org.apache.zookeeper.ClientCnxn: Client > session timed out, have not heard from server in 144490ms for > sessionid 0x12ee42283320050, closing socket connection and attempting > reconnect > > Which is a 2m20s GC pause. The HDFS errors come from the fact that the > master split the logs _while_ the region server was sleeping. > > J-D > > On Mon, Apr 11, 2011 at 11:47 AM, Jean-Daniel Cryans > <[EMAIL PROTECTED]> wrote: > > So my understanding is that this log file was opened at 7:29 and then > > something happened at 10:12:55 as something triggered the recovery on > > that block. It triggered a recovery of the block with the new name > > being blk_1213779416283711358_54249 > > > > It seems that that process was started by the DFS Client at 10:12:55 > > but the RS log starts at 10:14. Would it be possible to see what was > > before that? Also it would be nice to have a view for those blocks on > > all the datanodes. > > > > It would be nice to do this debugging on IRC is it can require a lot > > of back and forth. > > > > J-D > > > > On Mon, Apr 11, 2011 at 11:22 AM, Eran Kutner <eran@.com> wrote: > >> There wasn't an attachment, I pasted all the lines from all the NN logs > that > >> contain that particular block number inline. > >> > >> As for CPU/IO, first there is nothing else running on those servers, > second, > >> CPU utilization on the slaves at peak load was around 40% and disk IO > >> utilization less than 20%. That's the strange thing about it (I have > another > >> thread going about the performance), there is no bottleneck I could > identify > >> and yet the performance was relatively low, compared to the numbers I > see > >> quoted for HBase in other places. > >> > >> The first line of the NN log says: > >> BLOCK* NameSystem.allocateBlock: > >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/ > hadoop1-s01.farm-ny.gigya.com > %3A60020.1302434963279.blk_1213779416283711358_54194 > >> So it looks like a file name is: > >> /hbase/.logs/hadoop1-s01.farm-ny.gigya.com,60020,1302185988579/ > hadoop1-s01.farm-ny.gigya.com%3A60020.1302434963279 > >> > >> Is there a better way to associate a file with a block? > >> > >> -eran > >> > >> > >> > > > |