|
Lin XIAO
2012-11-27, 21:28
Keith Turner
2012-11-27, 21:38
Lin XIAO
2012-11-27, 21:42
Keith Turner
2012-11-27, 22:22
John Vines
2012-11-27, 22:24
Lin XIAO
2012-11-27, 23:20
Keith Turner
2012-11-28, 13:55
Lin XIAO
2012-11-28, 15:44
Keith Turner
2012-11-28, 15:55
Lin XIAO
2012-11-28, 16:07
Keith Turner
2012-11-28, 16:08
Keith Turner
2012-11-28, 16:20
Keith Turner
2012-11-28, 16:56
Lin XIAO
2012-11-28, 19:05
Lin XIAO
2012-11-28, 19:22
Keith Turner
2012-11-29, 18:16
|
-
Table deletion got stuckLin XIAO 2012-11-27, 21:28
When deleting tables, sometimes the operation get stuck for hours with
table status "UNKNOWN" shown in the monitor page. The only relevant logs I can find are similar to this: [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: is locationState: is<<@(null,10.0.0.10:41000[43b1b039a081368],null) It happens to a few tables in our cluster and most of them are hosted by one tablet server, so I killed the tablet server to force the tablets to be reassigned to other servers. The tables got deleted successfully after that. I'm wondering what I should do to solve problems like this. Right now, we still have a bunch of tables with status "UNKNOWN". I'm wondering if I need to find the tablet servers hosting those tables and restart them. Btw, the swappiness in our cluster is set to 60 now instead of the recommended value 10. Thanks, Lin
-
Re: Table deletion got stuckKeith Turner 2012-11-27, 21:38
When this happens, what do you seen when you run the following command?
accumulo org.apache.accumulo.server.fate.Admin print Keith On Tue, Nov 27, 2012 at 4:28 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: > When deleting tables, sometimes the operation get stuck for hours with > table status "UNKNOWN" shown in the monitor page. The only relevant > logs I can find are similar to this: > > [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: is > locationState: is<<@(null,10.0.0.10:41000[43b1b039a081368],null) > > It happens to a few tables in our cluster and most of them are hosted > by one tablet server, so I killed the tablet server to force the > tablets to be reassigned to other servers. The tables got deleted > successfully after that. I'm wondering what I should do to solve > problems like this. Right now, we still have a bunch of tables with > status "UNKNOWN". I'm wondering if I need to find the tablet servers > hosting those tables and restart them. > > Btw, the swappiness in our cluster is set to 60 now instead of the > recommended value 10. > > Thanks, > Lin >
-
Re: Table deletion got stuckLin XIAO 2012-11-27, 21:42
txid: 662a05cc47f39958 status: SUCCESSFUL op: TableRangeOp
locked: [] locking: [] top: null txid: 56a0bff9ec9ca336 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 3572c73dbd3aba73 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 448b068738d7af7a status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 09583b98336a3d0d status: FAILED op: DeleteTable locked: [] locking: [] top: null txid: 28608caeee4921c1 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 474eed6bb4ca7153 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 394429e4aa1a4ecd status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 5936f5303207629f status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 65d3480a2c9b8662 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 0a9e7c4eb9f12424 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 77e9bfda4189490d status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 4831c192ce2b3ecd status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 3e4a893e3d2cdfe3 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 1f4c647a48c383a6 status: FAILED op: DeleteTable locked: [] locking: [] top: null txid: 5165b610193bad9e status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 40856ee9e8ae4c93 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 7253e61081cfac4b status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 594beffb28696a1b status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 5e5fba2a68c16832 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 32301b3b6f3bb153 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 5781fcdd5a7d7336 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 5296eb970dee4d6c status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 1cff0e2db601191f status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 478bfcfc84c99bb5 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 0927034ab3b14fc7 status: SUCCESSFUL op: CreateTable locked: [] locking: [] top: null txid: 5f55fcee11ecbdae status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 4edda48ad708a822 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 17a3046a636ce5bb status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 06ddef5bc35ac902 status: FAILED op: DeleteTable locked: [] locking: [] top: null txid: 7add32014ff807a8 status: FAILED op: DeleteTable locked: [] locking: [] top: null txid: 279e92a85ecf4b11 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 77c581cad6d78b3d status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 0954597c10c1dd27 status: FAILED op: DeleteTable locked: [] locking: [] top: null txid: 5959936797c4486e status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 6d8fc72f539477d4 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 29f186318325f6b6 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 34d12ea53f5d4de6 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 309d5f1641b664bb status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 1be97c819f4ccafe status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 5a4b7d4388e72682 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 759234cac75efa7f status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 5f83019e58a7d6f1 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 0b5f7572289c9069 status: FAILED op: CreateTable locked: [] locking: [] top: null txid: 03df2e29e5e8e241 status: SUCCESSFUL op: DeleteTable locked: [] locking: [] top: null txid: 7091a54ac2e689e6 status: FAILED op: DeleteTable locked: [] locking: [] top: null txid: 44af272a6e1c9e3f status: FAILED op: CreateTab
-
Re: Table deletion got stuckKeith Turner 2012-11-27, 22:22
Look in the master logs to see what went wrong with the failed delete table
operations. On of them is 1f4c647a48c383a6, if you grep the master logs for 1f4c647a48c383a6 what do you see? On Tue, Nov 27, 2012 at 4:42 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: > txid: 662a05cc47f39958 status: SUCCESSFUL op: TableRangeOp > locked: [] locking: [] top: null > txid: 56a0bff9ec9ca336 status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 3572c73dbd3aba73 status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 448b068738d7af7a status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 09583b98336a3d0d status: FAILED op: DeleteTable > locked: [] locking: [] top: null > txid: 28608caeee4921c1 status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 474eed6bb4ca7153 status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 394429e4aa1a4ecd status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 5936f5303207629f status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 65d3480a2c9b8662 status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 0a9e7c4eb9f12424 status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 77e9bfda4189490d status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 4831c192ce2b3ecd status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 3e4a893e3d2cdfe3 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 1f4c647a48c383a6 status: FAILED op: DeleteTable > locked: [] locking: [] top: null > txid: 5165b610193bad9e status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 40856ee9e8ae4c93 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 7253e61081cfac4b status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 594beffb28696a1b status: SUCCESSFUL op: DeleteTable > locked: [] locking: [] top: null > txid: 5e5fba2a68c16832 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 32301b3b6f3bb153 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 5781fcdd5a7d7336 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 5296eb970dee4d6c status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 1cff0e2db601191f status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 478bfcfc84c99bb5 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 0927034ab3b14fc7 status: SUCCESSFUL op: CreateTable > locked: [] locking: [] top: null > txid: 5f55fcee11ecbdae status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 4edda48ad708a822 status: FAILED op: CreateTable > locked: [] locking: [] top: null > txid: 17a3046a636ce5bb status: FAILED op: CreateTable
-
Re: Table deletion got stuckJohn Vines 2012-11-27, 22:24
That should also pop up in the recent events in the monitor.
On Tue, Nov 27, 2012 at 5:22 PM, Keith Turner <[EMAIL PROTECTED]> wrote: > Look in the master logs to see what went wrong with the failed delete > table operations. On of them is 1f4c647a48c383a6, if you grep the master > logs for 1f4c647a48c383a6 what do you see? > > On Tue, Nov 27, 2012 at 4:42 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: > >> txid: 662a05cc47f39958 status: SUCCESSFUL op: TableRangeOp >> locked: [] locking: [] top: null >> txid: 56a0bff9ec9ca336 status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 3572c73dbd3aba73 status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 448b068738d7af7a status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 09583b98336a3d0d status: FAILED op: DeleteTable >> locked: [] locking: [] top: null >> txid: 28608caeee4921c1 status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 474eed6bb4ca7153 status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 394429e4aa1a4ecd status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 5936f5303207629f status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 65d3480a2c9b8662 status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 0a9e7c4eb9f12424 status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 77e9bfda4189490d status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 4831c192ce2b3ecd status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 3e4a893e3d2cdfe3 status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 1f4c647a48c383a6 status: FAILED op: DeleteTable >> locked: [] locking: [] top: null >> txid: 5165b610193bad9e status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 40856ee9e8ae4c93 status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 7253e61081cfac4b status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 594beffb28696a1b status: SUCCESSFUL op: DeleteTable >> locked: [] locking: [] top: null >> txid: 5e5fba2a68c16832 status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 32301b3b6f3bb153 status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 5781fcdd5a7d7336 status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 5296eb970dee4d6c status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 1cff0e2db601191f status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 478bfcfc84c99bb5 status: FAILED op: CreateTable >> locked: [] locking: [] top: null >> txid: 0927034ab3b14fc7 status: SUCCESSFUL op: CreateTable >> locked: [] locking: [] top: null >> txid: 5f55fcee11ecbdae status: FAILED op: CreateTable >> locked: [] locking: [] top: null
-
Re: Table deletion got stuckLin XIAO 2012-11-27, 23:20
I've only went through the master log generated today for FAILED transactions.
CreateTable operations failed because the table already exist while the DeleteTable failed because the table doesn't exist. I think the user run his hadoop jobs several times with same table names. If the table cannot be deleted, the following create operations will fail. I'm not sure why he tried to delete an non-existed table though. 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, tid=1f4c647a48c383a6 ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, type:NOTFOUND, description:Table does not exists) at org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) at org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) at org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) at org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) at org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:662) 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing lock entry 1 userData 1f4c647a48c383a6 lockType WRITE 27 04:52:16,569 [tableOps.Utils] INFO : table gf (1f4c647a48c383a6) unlocked for write A deleteTable operation succeeded after I killed the tablet server. Here are some related logs: 27 11:46:32,656 [zookeeper.DistributedReadWriteLock] INFO : Added lock entry 0 userData 12f1518e50918eaa lockType WRITE 27 11:46:32,658 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa) locked for write operation: DELETE 27 11:46:32,660 [tables.TableManager] DEBUG: Transitioning state for table n8 from ONLINE to DELETING 27 11:46:32,662 [master.EventCoordinator] INFO : deleting table n8 27 11:46:32,663 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: !0;!0<<@(null,10.0.0.36:41000[33b1b27a4ad120a],10.0.0.36:41000[33b1b27a4ad120a]) 27 11:46:32,663 [tables.TableManager] DEBUG: State transition to DELETING @ WatchedEvent state:SyncConnected type:NodeDataChanged path:/accumulo/c8e02396-a69f-48be-aec2-045bbc55fa0c/tables/n8/state 27 11:46:32,663 [master.EventCoordinator] INFO : Table state in zookeeper changed for n8 to DELETING 27 11:46:32,685 [master.Master] DEBUG: Finished gathering information from 30 servers in 0.02 seconds 27 11:46:32,686 [master.Master] DEBUG: Telling 10.0.0.38:41000[23b1b03315212c6] to use loggers [10.0.0.64:11224, 10.0.0.15:11224] 27 11:46:32,686 [master.Master] DEBUG: Telling 10.0.0.36:41000[33b1b27a4ad120a] to use loggers [10.0.0.47:11224, 10.0.0.51:11224] 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: n8 locationState: n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) 27 14:26:09,101 [zookeeper.ZooReaderWriter] WARN : Error connecting to zookeeper, will retry in 250 org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /accumulo/c8e02396-a69f-48be-aec2-045bbc55fa0c/fate/tx_12f1518e50918eaa at org.apache.zookeeper.KeeperException.create(KeeperException.java:90) at org.apache.zookeeper.KeeperException.create(KeeperException.java:42) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950) at org.apache.accumulo.core.zookeeper.ZooReader.getData(ZooReader.java:42) at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.server.zookeeper.ZooReaderWriter$1.invoke(ZooReaderWriter.java:169) at $Proxy3.getData(Unknown Source) at org.apache.accumulo.server.fate.ZooStore._getStatus(ZooStore.java:305) at org.apache.accumulo.server.fate.ZooStore.waitForStatusChange(ZooStore.java:327) at org.apache.accumulo.server.fate.Fate.waitForCompletion(Fate.java:186) at org.apache.accumulo.server.master.Master$MasterClientServiceHandler.waitForTableOperation(Master.java:1041) at sun.reflect.GeneratedMethodAccessor30.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.cloudtrace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:59) at $Proxy4.waitForTableOperation(Unknown Source) at org.apache.accumulo.core.master.thrift.MasterClientService$Processor$waitForTableOperation.process(MasterClientService.java:1988) at org.apache.accumulo.core.master.thrift.MasterClientService$Processor.process(MasterClientService.java:1456) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:154) at org.apache.thrift.server.TNonblockingServer$FrameBuffer.invoke(TNonblockingServer.java:631) at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:202) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:662) 27 16:08:04,530 [master.CoordinateRecoveryTask] WARN : Recovery of 10.0.0.10:11224:d5706280-f28b-4da0-9a64-8262b946686a failed 27 16:08:04,532 [master.CoordinateRecoveryTask] DEBUG: Starting log recovery: 10.0.0.14:11224/9887ab7c-6352-43f9-b6bd-abb901aa40b3 27 16:08:04,533 [master.CoordinateRecoveryTask] DEBUG: Starting to copy 9887ab7c-6352-43f9-b6bd-abb901aa40b3 from 10.0.0.14:11224 27 16:08:04,673 [security.Auditor] AUDIT: Using credentials !SYSTEM: deleted table n8 27 16:08:04,674 [zookeeper.DistributedReadWriteLock] DEBUG: Removing lock entry 0 userData 12f1518e50918eaa lockType WRITE 27 16:08:04,678 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa) unlocked for write 27 16:08:04,678 [tabl
-
Re: Table deletion got stuckKeith Turner 2012-11-28, 13:55
Can you look at the logs for tablet server 10.0.0.10 and see what was going
on with tablet n8<<? Keith On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: > I've only went through the master log generated today for FAILED > transactions. > CreateTable operations failed because the table already exist while > the DeleteTable failed because the table doesn't exist. I think the > user run his hadoop jobs several times with same table names. If the > table cannot be deleted, the following create operations will fail. > I'm not sure why he tried to delete an non-existed table though. > > 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, > tid=1f4c647a48c383a6 > ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, > type:NOTFOUND, description:Table does not exists) > at > org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) > at > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) > at > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) > at > org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) > at org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) > at > org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > at java.lang.Thread.run(Thread.java:662) > 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing > lock entry 1 userData 1f4c647a48c383a6 lockType WRITE > 27 04:52:16,569 [tableOps.Utils] INFO : table gf (1f4c647a48c383a6) > unlocked for write > > A deleteTable operation succeeded after I killed the tablet server. > Here are some related logs: > > 27 11:46:32,656 [zookeeper.DistributedReadWriteLock] INFO : Added lock > entry 0 userData 12f1518e50918eaa lockType WRITE > 27 11:46:32,658 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa) > locked for write operation: DELETE > 27 11:46:32,660 [tables.TableManager] DEBUG: Transitioning state for > table n8 from ONLINE to DELETING > 27 11:46:32,662 [master.EventCoordinator] INFO : deleting table n8 > 27 11:46:32,663 [state.ZooTabletStateStore] DEBUG: Returning root > tablet state: !0;!0<<@(null,10.0.0.36:41000 > [33b1b27a4ad120a],10.0.0.36:41000[33b1b27a4ad120a]) > 27 11:46:32,663 [tables.TableManager] DEBUG: State transition to > DELETING @ WatchedEvent state:SyncConnected type:NodeDataChanged > path:/accumulo/c8e02396-a69f-48be-aec2-045bbc55fa0c/tables/n8/state > 27 11:46:32,663 [master.EventCoordinator] INFO : Table state in > zookeeper changed for n8 to DELETING > 27 11:46:32,685 [master.Master] DEBUG: Finished gathering information > from 30 servers in 0.02 seconds > 27 11:46:32,686 [master.Master] DEBUG: Telling > 10.0.0.38:41000[23b1b03315212c6] to use loggers [10.0.0.64:11224, > 10.0.0.15:11224] > 27 11:46:32,686 [master.Master] DEBUG: Telling > 10.0.0.36:41000[33b1b27a4ad120a] to use loggers [10.0.0.47:11224, > 10.0.0.51:11224] > 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for table to > be deleted: n8 locationState: > n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) > 27 14:26:09,101 [zookeeper.ZooReaderWriter] WARN : Error connecting to > zookeeper, will retry in 250 > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for > /accumulo/c8e02396-a69f-48be-aec2-045bbc55fa0c/fate/tx_12f1518e50918eaa > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:90) > at > org.apache.zookeeper.KeeperException.create(KeeperException.java:42) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:921) > at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:950) > at > org.apache.accumulo.core.zookeeper.ZooReader.getData(ZooReader.java:42) > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597)
-
Re: Table deletion got stuckLin XIAO 2012-11-28, 15:44
n8 was an empty table created through the shell. Here are the logs on
machine 10.0.0.10 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< 27 11:52:25,221 [tabletserver.TabletServer] INFO : cloud9/10.0.0.10:41000: got assignment from master: n8<< 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent n8<< 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] 423354 false=/default_tablet, n8< srv:lock [] 423354 false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] 423354 false=M0, n8< ~tab:~pr [] 423354 false=} 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened Thanks, Lin On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > Can you look at the logs for tablet server 10.0.0.10 and see what was going > on with tablet n8<<? > > Keith > > > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> I've only went through the master log generated today for FAILED >> transactions. >> CreateTable operations failed because the table already exist while >> the DeleteTable failed because the table doesn't exist. I think the >> user run his hadoop jobs several times with same table names. If the >> table cannot be deleted, the following create operations will fail. >> I'm not sure why he tried to delete an non-existed table though. >> >> 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, >> tid=1f4c647a48c383a6 >> ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, >> type:NOTFOUND, description:Table does not exists) >> at >> org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) >> at >> org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) >> at >> org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) >> at >> org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) >> at >> org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) >> at >> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) >> at java.lang.Thread.run(Thread.java:662) >> 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing >> lock entry 1 userData 1f4c647a48c383a6 lockType WRITE >> 27 04:52:16,569 [tableOps.Utils] INFO : table gf (1f4c647a48c383a6) >> unlocked for write >> >> A deleteTable operation succeeded after I killed the tablet server. >> Here are some related logs: >> >> 27 11:46:32,656 [zookeeper.DistributedReadWriteLock] INFO : Added lock >> entry 0 userData 12f1518e50918eaa lockType WRITE >> 27 11:46:32,658 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa) >> locked for write operation: DELETE >> 27 11:46:32,660 [tables.TableManager] DEBUG: Transitioning state for >> table n8 from ONLINE to DELETING >> 27 11:46:32,662 [master.EventCoordinator] INFO : deleting table n8 >> 27 11:46:32,663 [state.ZooTabletStateStore] DEBUG: Returning root >> tablet state: >> !0;!0<<@(null,10.0.0.36:41000[33b1b27a4ad120a],10.0.0.36:41000[33b1b27a4ad120a]) >> 27 11:46:32,663 [tables.TableManager] DEBUG: State transition to >> DELETING @ WatchedEvent state:SyncConnected type:NodeDataChanged >> path:/accumulo/c8e02396-a69f-48be-aec2-045bbc55fa0c/tables/n8/state >> 27 11:46:32,663 [master.EventCoordinator] INFO : Table state in >> zookeeper changed for n8 to DELETING >> 27 11:46:32,685 [master.Master] DEBUG: Finished gathering information >> from 30 servers in 0.02 seconds >> 27 11:46:32,686 [master.Master] DEBUG: Telling >> 10.0.0.38:41000[23b1b03315212c6] to use loggers [10.0.0.64:11224, >> 10.0.0.15:11224] >> 27 11:46:32,686 [master.Master] DEBUG: Telling >> 10.0.0.36:41000[33b1b27a4ad120a] to use loggers [10.0.0.47:11224, >> 10.0.0.51:11224] >> 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for table to
-
Re: Table deletion got stuckKeith Turner 2012-11-28, 15:55
Are the times on the master and tablet server synched? The load of n8<< on
the tablet server seems to occur after delete is waiting for it. master.log : 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: n8 locationState: n8<<@(null,10.0.0.10:41000 [43b1b039a081368],null) tserver.log : 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: > n8 was an empty table created through the shell. Here are the logs on > machine 10.0.0.10 > > 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< > 27 11:52:25,221 [tabletserver.TabletServer] INFO : > cloud9/10.0.0.10:41000: got assignment from master: n8<< > 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< > 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent n8<< > 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< > future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] > 423354 false=/default_tablet, n8< srv:lock [] 423354 > false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] > 423354 false=M0, n8< ~tab:~pr [] 423354 false=} > 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< > 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened > > Thanks, > Lin > > On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > > Can you look at the logs for tablet server 10.0.0.10 and see what was > going > > on with tablet n8<<? > > > > Keith > > > > > > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: > >> > >> I've only went through the master log generated today for FAILED > >> transactions. > >> CreateTable operations failed because the table already exist while > >> the DeleteTable failed because the table doesn't exist. I think the > >> user run his hadoop jobs several times with same table names. If the > >> table cannot be deleted, the following create operations will fail. > >> I'm not sure why he tried to delete an non-existed table though. > >> > >> 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, > >> tid=1f4c647a48c383a6 > >> ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, > >> type:NOTFOUND, description:Table does not exists) > >> at > >> > org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) > >> at > >> > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) > >> at > >> > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) > >> at > >> > org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) > >> at > >> org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) > >> at > >> > org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > >> at java.lang.Thread.run(Thread.java:662) > >> 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing > >> lock entry 1 userData 1f4c647a48c383a6 lockType WRITE > >> 27 04:52:16,569 [tableOps.Utils] INFO : table gf (1f4c647a48c383a6) > >> unlocked for write > >> > >> A deleteTable operation succeeded after I killed the tablet server. > >> Here are some related logs: > >> > >> 27 11:46:32,656 [zookeeper.DistributedReadWriteLock] INFO : Added lock > >> entry 0 userData 12f1518e50918eaa lockType WRITE > >> 27 11:46:32,658 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa) > >> locked for write operation: DELETE > >> 27 11:46:32,660 [tables.TableManager] DEBUG: Transitioning state for > >> table n8 from ONLINE to DELETING > >> 27 11:46:32,662 [master.EventCoordinator] INFO : deleting table n8 > >> 27 11:46:32,663 [state.ZooTabletStateStore] DEBUG: Returning root > >> tablet state: > >> !0;!0<<@(null,10.0.0.36:41000[33b1b27a4ad120a],10.0.0.36:41000 > [33b1b27a4ad120a]) > >> 27 11:46:32,663 [tables.TableManager] DEBUG: State transition to > >> DELETING @ WatchedEvent state:SyncConnected type:NodeDataChanged
-
Re: Table deletion got stuckLin XIAO 2012-11-28, 16:07
No. I think there were about 5 minutes delayed on the server. I didn't
realize that ntp wasn't running on the server until seeing the problems. On Wed, Nov 28, 2012 at 10:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > Are the times on the master and tablet server synched? The load of n8<< on > the tablet server seems to occur after delete is waiting for it. > > master.log : 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for > table to be deleted: n8 locationState: > n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) > tserver.log : 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading > tablet n8<< > > > On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> n8 was an empty table created through the shell. Here are the logs on >> machine 10.0.0.10 >> >> 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< >> 27 11:52:25,221 [tabletserver.TabletServer] INFO : >> cloud9/10.0.0.10:41000: got assignment from master: n8<< >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent n8<< >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< >> future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] >> 423354 false=/default_tablet, n8< srv:lock [] 423354 >> false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] >> 423354 false=M0, n8< ~tab:~pr [] 423354 false=} >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< >> 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened >> >> Thanks, >> Lin >> >> On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: >> > Can you look at the logs for tablet server 10.0.0.10 and see what was >> > going >> > on with tablet n8<<? >> > >> > Keith >> > >> > >> > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> >> >> I've only went through the master log generated today for FAILED >> >> transactions. >> >> CreateTable operations failed because the table already exist while >> >> the DeleteTable failed because the table doesn't exist. I think the >> >> user run his hadoop jobs several times with same table names. If the >> >> table cannot be deleted, the following create operations will fail. >> >> I'm not sure why he tried to delete an non-existed table though. >> >> >> >> 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, >> >> tid=1f4c647a48c383a6 >> >> ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, >> >> type:NOTFOUND, description:Table does not exists) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) >> >> at >> >> >> >> org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) >> >> at >> >> >> >> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) >> >> at java.lang.Thread.run(Thread.java:662) >> >> 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing >> >> lock entry 1 userData 1f4c647a48c383a6 lockType WRITE >> >> 27 04:52:16,569 [tableOps.Utils] INFO : table gf (1f4c647a48c383a6) >> >> unlocked for write >> >> >> >> A deleteTable operation succeeded after I killed the tablet server. >> >> Here are some related logs: >> >> >> >> 27 11:46:32,656 [zookeeper.DistributedReadWriteLock] INFO : Added lock >> >> entry 0 userData 12f1518e50918eaa lockType WRITE >> >> 27 11:46:32,658 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa) >> >> locked for write operation: DELETE >> >> 27 11:46:32,660 [tables.TableManager] DEBUG: Transitioning state for >> >> table n8 from ONLINE to DELETING >> >> 27 11:46:32,662 [master.EventCoordinator] INFO : deleting table n8
-
Re: Table deletion got stuckKeith Turner 2012-11-28, 16:08
27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for table to be
deleted: n8 locationState: n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) The delete code waits for all tablets related to a table to unload. It will wait for tablets that are assigned or hosted. An assigned tablet is one that the master has asked a tablet server to load, but the tablet server has not yet loaded. A hosted tablet is loaded. The debug message seems to indicate n8<< was hosted. If it were assigned, but not hosted I think it would print n8<<@(10.0.0.10:41000[43b1b039a081368],null,null) instead of @(null,10.0.0.10:41000[43b1b039a081368],null). So the master thought the tablet was loaded, even though the tablet load time was later. What times did the tablet server start? You can grep tablet server logs for "Instance". The Accumulo instance id is logged when a tablet server starts. Also, are there any FATAL messages in the tserver logs? Keith On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: > n8 was an empty table created through the shell. Here are the logs on > machine 10.0.0.10 > > 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< > 27 11:52:25,221 [tabletserver.TabletServer] INFO : > cloud9/10.0.0.10:41000: got assignment from master: n8<< > 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< > 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent n8<< > 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< > future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] > 423354 false=/default_tablet, n8< srv:lock [] 423354 > false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] > 423354 false=M0, n8< ~tab:~pr [] 423354 false=} > 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< > 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened > > Thanks, > Lin > > On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > > Can you look at the logs for tablet server 10.0.0.10 and see what was > going > > on with tablet n8<<? > > > > Keith > > > > > > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: > >> > >> I've only went through the master log generated today for FAILED > >> transactions. > >> CreateTable operations failed because the table already exist while > >> the DeleteTable failed because the table doesn't exist. I think the > >> user run his hadoop jobs several times with same table names. If the > >> table cannot be deleted, the following create operations will fail. > >> I'm not sure why he tried to delete an non-existed table though. > >> > >> 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, > >> tid=1f4c647a48c383a6 > >> ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, > >> type:NOTFOUND, description:Table does not exists) > >> at > >> > org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) > >> at > >> > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) > >> at > >> > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) > >> at > >> > org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) > >> at > >> org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) > >> at > >> > org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) > >> at java.lang.Thread.run(Thread.java:662) > >> 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing > >> lock entry 1 userData 1f4c647a48c383a6 lockType WRITE > >> 27 04:52:16,569 [tableOps.Utils] INFO : table gf (1f4c647a48c383a6) > >> unlocked for write > >> > >> A deleteTable operation succeeded after I killed the tablet server. > >> Here are some related logs: > >> > >> 27 11:46:32,656 [zookeeper.DistributedReadWriteLock] INFO : Added lock > >> entry 0 userData 12f1518e50918eaa lockType WRITE > >> 27 11:46:32,658 [tableOps.Utils] INFO : table n8 (12f1518e50918eaa)
-
Re: Table deletion got stuckKeith Turner 2012-11-28, 16:20
I looked at the tablet server code to see what messages are logged when a
tablet is unloaded. When the unload request if received from the master, it throws a task on a thread pool to do the unload. Not until this task runs will you actually see anything in the logs. When the task runs, I think one of the following may be executed... but not all... maybe none. log.info("told to unload tablet that was not being served " + extent); log.debug("initiateClose(saveState=" + saveState + " queueMinC=" + queueMinC + " disableWrites=" + disableWrites + ") " + getExtent()); log.debug("Failed to unload tablet " + extent + "... it was alread closing or closed : " + e.getMessage()); log.error("Failed to close tablet " + extent + "... Aborting migration", e); If you are not seeing the initiateClose log message, one possibility is that another unload task was tying up the thread pool that processes unload. One common cause of this is someone deleting a table with a bad iterator. Keith On Wed, Nov 28, 2012 at 11:07 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: > No. I think there were about 5 minutes delayed on the server. I didn't > realize that ntp wasn't running on the server until seeing the > problems. > > On Wed, Nov 28, 2012 at 10:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > > Are the times on the master and tablet server synched? The load of n8<< > on > > the tablet server seems to occur after delete is waiting for it. > > > > master.log : 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for > > table to be deleted: n8 locationState: > > n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) > > tserver.log : 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading > > tablet n8<< > > > > > > On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: > >> > >> n8 was an empty table created through the shell. Here are the logs on > >> machine 10.0.0.10 > >> > >> 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< > >> 27 11:52:25,221 [tabletserver.TabletServer] INFO : > >> cloud9/10.0.0.10:41000: got assignment from master: n8<< > >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< > >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent n8<< > >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< > >> future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] > >> 423354 false=/default_tablet, n8< srv:lock [] 423354 > >> false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] > >> 423354 false=M0, n8< ~tab:~pr [] 423354 false=} > >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< > >> 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened > >> > >> Thanks, > >> Lin > >> > >> On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > >> > Can you look at the logs for tablet server 10.0.0.10 and see what was > >> > going > >> > on with tablet n8<<? > >> > > >> > Keith > >> > > >> > > >> > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> > wrote: > >> >> > >> >> I've only went through the master log generated today for FAILED > >> >> transactions. > >> >> CreateTable operations failed because the table already exist while > >> >> the DeleteTable failed because the table doesn't exist. I think the > >> >> user run his hadoop jobs several times with same table names. If the > >> >> table cannot be deleted, the following create operations will fail. > >> >> I'm not sure why he tried to delete an non-existed table though. > >> >> > >> >> 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, > >> >> tid=1f4c647a48c383a6 > >> >> ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, > >> >> type:NOTFOUND, description:Table does not exists) > >> >> at > >> >> > >> >> > org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) > >> >> at > >> >> > >> >> > org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224)
-
Re: Table deletion got stuckKeith Turner 2012-11-28, 16:56
On Tue, Nov 27, 2012 at 4:28 PM, Lin XIAO <[EMAIL PROTECTED]> wrote:
> When deleting tables, sometimes the operation get stuck for hours with > table status "UNKNOWN" shown in the monitor page. The only relevant > logs I can find are similar to this: > > [tableOps.CleanUp] DEBUG: Still waiting for table to be deleted: is > locationState: is<<@(null,10.0.0.10:41000[43b1b039a081368],null) > > It happens to a few tables in our cluster and most of them are hosted > by one tablet server, so I killed the tablet server to force the > tablets to be reassigned to other servers. The tables got deleted > In the future save the output of running jstack on the tablet server before killing it. > successfully after that. I'm wondering what I should do to solve > problems like this. Right now, we still have a bunch of tables with > status "UNKNOWN". I'm wondering if I need to find the tablet servers > hosting those tables and restart them. > > Btw, the swappiness in our cluster is set to 60 now instead of the > recommended value 10. > > Thanks, > Lin >
-
Re: Table deletion got stuckLin XIAO 2012-11-28, 19:05
The tserver started on November 20:
20 15:13:14,328 [client.ZooKeeperInstance] DEBUG: Trying to read instance id from /table/accumulo/instance_id 20 15:13:14,328 [server.Accumulo] INFO : Instance c8e02396-a69f-48be-aec2-045bbc55fa0c There was no FATAL log message. On Wed, Nov 28, 2012 at 11:08 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for table to be > deleted: n8 locationState: n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) > > The delete code waits for all tablets related to a table to unload. It will > wait for tablets that are assigned or hosted. An assigned tablet is one > that the master has asked a tablet server to load, but the tablet server has > not yet loaded. A hosted tablet is loaded. > > The debug message seems to indicate n8<< was hosted. If it were assigned, > but not hosted I think it would print > n8<<@(10.0.0.10:41000[43b1b039a081368],null,null) instead of > @(null,10.0.0.10:41000[43b1b039a081368],null). > > So the master thought the tablet was loaded, even though the tablet load > time was later. > > What times did the tablet server start? You can grep tablet server logs for > "Instance". The Accumulo instance id is logged when a tablet server starts. > Also, are there any FATAL messages in the tserver logs? > > Keith > > On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> n8 was an empty table created through the shell. Here are the logs on >> machine 10.0.0.10 >> >> 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< >> 27 11:52:25,221 [tabletserver.TabletServer] INFO : >> cloud9/10.0.0.10:41000: got assignment from master: n8<< >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent n8<< >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< >> future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] >> 423354 false=/default_tablet, n8< srv:lock [] 423354 >> false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] >> 423354 false=M0, n8< ~tab:~pr [] 423354 false=} >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< >> 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened >> >> Thanks, >> Lin >> >> On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: >> > Can you look at the logs for tablet server 10.0.0.10 and see what was >> > going >> > on with tablet n8<<? >> > >> > Keith >> > >> > >> > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> >> >> I've only went through the master log generated today for FAILED >> >> transactions. >> >> CreateTable operations failed because the table already exist while >> >> the DeleteTable failed because the table doesn't exist. I think the >> >> user run his hadoop jobs several times with same table names. If the >> >> table cannot be deleted, the following create operations will fail. >> >> I'm not sure why he tried to delete an non-existed table though. >> >> >> >> 27 04:52:16,547 [fate.Fate] WARN : Failed to execute Repo, >> >> tid=1f4c647a48c383a6 >> >> ThriftTableOperationException(tableId:gf, tableName:, op:DELETE, >> >> type:NOTFOUND, description:Table does not exists) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.Utils.reserveTable(Utils.java:82) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:224) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.DeleteTable.isReady(DeleteTable.java:212) >> >> at >> >> >> >> org.apache.accumulo.server.master.tableOps.TraceRepo.isReady(TraceRepo.java:50) >> >> at >> >> >> >> org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:62) >> >> at >> >> >> >> org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) >> >> at java.lang.Thread.run(Thread.java:662) >> >> 27 04:52:16,564 [zookeeper.DistributedReadWriteLock] DEBUG: Removing
-
Re: Table deletion got stuckLin XIAO 2012-11-28, 19:22
I cannot find any initiateClose message ever since 10:38 on the
tserver. What can I do to test if a tserver hangs because someone tries to deletes a table with a bad iterator? I'll save the output of running jstack next time. Thanks, Lin On Wed, Nov 28, 2012 at 11:20 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > I looked at the tablet server code to see what messages are logged when a > tablet is unloaded. When the unload request if received from the master, > it throws a task on a thread pool to do the unload. Not until this task > runs will you actually see anything in the logs. > > When the task runs, I think one of the following may be executed... but not > all... maybe none. > > log.info("told to unload tablet that was not being served " + extent); > > log.debug("initiateClose(saveState=" + saveState + " queueMinC=" + queueMinC > + " disableWrites=" + disableWrites + ") " + getExtent()); > > log.debug("Failed to unload tablet " + extent + "... it was alread closing > or closed : " + e.getMessage()); > > log.error("Failed to close tablet " + extent + "... Aborting migration", e); > > If you are not seeing the initiateClose log message, one possibility is that > another unload task was tying up the thread pool that processes unload. > One common cause of this is someone deleting a table with a bad iterator. > > Keith > > > On Wed, Nov 28, 2012 at 11:07 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> No. I think there were about 5 minutes delayed on the server. I didn't >> realize that ntp wasn't running on the server until seeing the >> problems. >> >> On Wed, Nov 28, 2012 at 10:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: >> > Are the times on the master and tablet server synched? The load of n8<< >> > on >> > the tablet server seems to occur after delete is waiting for it. >> > >> > master.log : 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting for >> > table to be deleted: n8 locationState: >> > n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) >> > tserver.log : 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading >> > tablet n8<< >> > >> > >> > On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: >> >> >> >> n8 was an empty table created through the shell. Here are the logs on >> >> machine 10.0.0.10 >> >> >> >> 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet n8<< >> >> 27 11:52:25,221 [tabletserver.TabletServer] INFO : >> >> cloud9/10.0.0.10:41000: got assignment from master: n8<< >> >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: n8<< >> >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent >> >> n8<< >> >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< >> >> future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir [] >> >> 423354 false=/default_tablet, n8< srv:lock [] 423354 >> >> false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] >> >> 423354 false=M0, n8< ~tab:~pr [] 423354 false=} >> >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< >> >> 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened >> >> >> >> Thanks, >> >> Lin >> >> >> >> On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> wrote: >> >> > Can you look at the logs for tablet server 10.0.0.10 and see what was >> >> > going >> >> > on with tablet n8<<? >> >> > >> >> > Keith >> >> > >> >> > >> >> > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> >> >> > wrote: >> >> >> >> >> >> I've only went through the master log generated today for FAILED >> >> >> transactions. >> >> >> CreateTable operations failed because the table already exist while >> >> >> the DeleteTable failed because the table doesn't exist. I think the >> >> >> user run his hadoop jobs several times with same table names. If the >> >> >> table cannot be deleted, the following create operations will fail. >> >> >> I'm not sure why he tried to delete an non-existed table though. >> >> >>
-
Re: Table deletion got stuckKeith Turner 2012-11-29, 18:16
On Wed, Nov 28, 2012 at 2:22 PM, Lin XIAO <[EMAIL PROTECTED]> wrote:
> I cannot find any initiateClose message ever since 10:38 on the > tserver. What can I do to test if a tserver hangs because someone > tries to deletes a table with a bad iterator? > I think you should see an error logged by org.apache.accumulo.server.tabletserver.Compactor in the tablet server logs if an iterator throws an exception. > > I'll save the output of running jstack next time. > > Thanks, > Lin > > On Wed, Nov 28, 2012 at 11:20 AM, Keith Turner <[EMAIL PROTECTED]> wrote: > > I looked at the tablet server code to see what messages are logged when a > > tablet is unloaded. When the unload request if received from the > master, > > it throws a task on a thread pool to do the unload. Not until this task > > runs will you actually see anything in the logs. > > > > When the task runs, I think one of the following may be executed... but > not > > all... maybe none. > > > > log.info("told to unload tablet that was not being served " + extent); > > > > log.debug("initiateClose(saveState=" + saveState + " queueMinC=" + > queueMinC > > + " disableWrites=" + disableWrites + ") " + getExtent()); > > > > log.debug("Failed to unload tablet " + extent + "... it was alread > closing > > or closed : " + e.getMessage()); > > > > log.error("Failed to close tablet " + extent + "... Aborting migration", > e); > > > > If you are not seeing the initiateClose log message, one possibility is > that > > another unload task was tying up the thread pool that processes unload. > > One common cause of this is someone deleting a table with a bad iterator. > > > > Keith > > > > > > On Wed, Nov 28, 2012 at 11:07 AM, Lin XIAO <[EMAIL PROTECTED]> wrote: > >> > >> No. I think there were about 5 minutes delayed on the server. I didn't > >> realize that ntp wasn't running on the server until seeing the > >> problems. > >> > >> On Wed, Nov 28, 2012 at 10:55 AM, Keith Turner <[EMAIL PROTECTED]> > wrote: > >> > Are the times on the master and tablet server synched? The load of > n8<< > >> > on > >> > the tablet server seems to occur after delete is waiting for it. > >> > > >> > master.log : 27 11:48:04,332 [tableOps.CleanUp] DEBUG: Still waiting > for > >> > table to be deleted: n8 locationState: > >> > n8<<@(null,10.0.0.10:41000[43b1b039a081368],null) > >> > tserver.log : 27 11:52:25,220 [tabletserver.TabletServer] INFO : > Loading > >> > tablet n8<< > >> > > >> > > >> > On Wed, Nov 28, 2012 at 10:44 AM, Lin XIAO <[EMAIL PROTECTED]> > wrote: > >> >> > >> >> n8 was an empty table created through the shell. Here are the logs > on > >> >> machine 10.0.0.10 > >> >> > >> >> 27 11:52:25,220 [tabletserver.TabletServer] INFO : Loading tablet > n8<< > >> >> 27 11:52:25,221 [tabletserver.TabletServer] INFO : > >> >> cloud9/10.0.0.10:41000: got assignment from master: n8<< > >> >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: Loading extent: > n8<< > >> >> 27 11:52:25,221 [tabletserver.TabletServer] DEBUG: verifying extent > >> >> n8<< > >> >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: Looking at metadata {n8< > >> >> future:43b1b039a081368 [] 423355 false=10.0.0.10:41000, n8< srv:dir > [] > >> >> 423354 false=/default_tablet, n8< srv:lock [] 423354 > >> >> false=masters/lock/zlock-0000000184$43b1b039a08ad85, n8< srv:time [] > >> >> 423354 false=M0, n8< ~tab:~pr [] 423354 false=} > >> >> 27 11:52:25,223 [tabletserver.Tablet] DEBUG: got [] for logs for n8<< > >> >> 27 11:52:25,230 [tabletserver.Tablet] TABLET_HIST: n8<< opened > >> >> > >> >> Thanks, > >> >> Lin > >> >> > >> >> On Wed, Nov 28, 2012 at 8:55 AM, Keith Turner <[EMAIL PROTECTED]> > wrote: > >> >> > Can you look at the logs for tablet server 10.0.0.10 and see what > was > >> >> > going > >> >> > on with tablet n8<<? > >> >> > > >> >> > Keith > >> >> > > >> >> > > >> >> > On Tue, Nov 27, 2012 at 6:20 PM, Lin XIAO <[EMAIL PROTECTED]> > >> >> > wrote: > >> >> >> > >> >> >> I've only went through the master log generated today for FAILED |