|
Guillaume Perrot
2012-11-15, 13:21
Stack
2012-11-16, 00:56
Vincent Barat
2012-11-16, 16:14
Ted Yu
2012-11-16, 17:13
Vincent Barat
2012-11-16, 17:20
Vincent Barat
2012-11-16, 19:55
|
-
Lots of SocketTimeoutException for gets and puts since HBase 0.92.1Guillaume Perrot 2012-11-15, 13:21
Hi all, we just upgraded our HBase cluster from 0.90.3 to 0.92.1 and now we
have a lot of warnings like these in our clients: 2012-11-15 01:31:57,734 | WARN | <our thread> | HConnectionManager$HConnectionImplementation | Failed all from region=<our_table>,0d9750f9e22628e94dd33a78292d62 01,1346224022442.a42b483bb10fbaea70f8616e7f06899c., hostname=<our_host>, port=60020 java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: Call to <our_host>:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : ja\ va.nio.channels.SocketChannel[connected local=/<some_host>:12492 remote=<our_host>] at java.util.concurrent.FutureTask$Sync.innerGet( FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation.processBatchCallback(HConnectionManager.java:1557) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation.processBatch(HConnectionManager.java:1409) at org.apache.hadoop.hbase.client.HTable.batch(HTable.java:746) at org.apache.hadoop.hbase.client.HTable.get(HTable.java:715) at org.apache.hadoop.hbase.client.HTablePool$ PooledHTable.get(HTablePool.java:371) at <our_client_code> Caused by: java.net.SocketTimeoutException: Call to <our_host>:60020 failed on socket timeout exception: java.net.SocketTimeoutException: 60000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[\ connected local=<some_host>:12492 remote=<our_host>:60020] at org.apache.hadoop.hbase.ipc.HBaseClient.wrapException( HBaseClient.java:949) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient. java:922) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker. invoke(WritableRpcEngine.java:150) at $Proxy7.multi(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation$3$1.call(HConnectionManager.java:1386) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation$3$1.call(HConnectionManager.java:1384) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation.getRegionServerWithoutRetries( HConnectionManager.java:1365) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation$3.call(HConnectionManager.java:1383) at org.apache.hadoop.hbase.client.HConnectionManager$ HConnectionImplementation$3.call(HConnectionManager.java:1381) at java.util.concurrent.FutureTask$Sync.innerRun( FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker. runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run( ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) We have that on puts too (puts seems to use a lower value for the socket timeout): 2012-11-14 20:44:55,320 | WARN | <our_thread> | HConnectionManager$HConnectionImplementation | Failed all from region=<our_table>,,1346224022442.e90f2b7680df46d93d8ecd13eee08265., hostname=<our_server>, port=60020 java.util.concurrent.ExecutionException: java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<our_server>:60020] at java.util.concurrent.FutureTask$Sync.innerGet(FutureTask.java:222) at java.util.concurrent.FutureTask.get(FutureTask.java:83) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatchCallback(HConnectionManager.java:1557) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.processBatch(HConnectionManager.java:1409) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:943) at org.apache.hadoop.hbase.client.HTable.doPut(HTable.java:820) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:803) at org.apache.hadoop.hbase.client.HTablePool$PooledHTable.put(HTablePool.java:402) at <our_client_code> Caused by: java.net.SocketTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=<our_server>:60020] at org.apache.hadoop.net .SocketIOWithTimeout.connect(SocketIOWithTimeout.java:213) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:489) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupConnection(HBaseClient.java:328) at org.apache.hadoop.hbase.ipc.HBaseClient$Connection.setupIOstreams(HBaseClient.java:362) at org.apache.hadoop.hbase.ipc.HBaseClient.getConnection(HBaseClient.java:1045) at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:897) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Invoker.invoke(WritableRpcEngine.java:150) at $Proxy7.multi(Unknown Source) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1386) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3$1.call(HConnectionManager.java:1384) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithoutRetries(HConnectionManager.java:1365) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1383) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation$3.call(HConnectionManager.java:1381) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.T +
Guillaume Perrot 2012-11-15, 13:21
-
Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1Stack 2012-11-16, 00:56
On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <[EMAIL PROTECTED]> wrote:
> It happens when several tables are being compacted and/or when there is > several scanners running. It happens for a particular region? Anything you can tell about the server looking in your cluster monitoring? Is it running hot? What do the hbase regionserver stats in UI say? Anything interesting about compaction queues or requests? If you look at the thread dump all handlers are occupied serving requests? These timedout requests couldn't get into the server? > Before the timeouts, we observe an increasing CPU load on a single region > server and if we add region servers and wait for rebalancing, we always > have the same region server causing problems like these: > > 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server Responder, call > multi(org.apache.hadoop.hbase.client.MultiAction@2c3da1aa), rpc > version=1, client version=29, methodsFingerPrint=54742778 from > <ip>:45334: output error > 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.HBaseServer: IPC > Server handler 3 on 60020 caught: java.nio.channels.ClosedChannelException > at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) > at > org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1653) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder. > processResponse(HBaseServer.java:924) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Responder. > doRespond(HBaseServer.java:1003) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady( > HBaseServer.java:409) > at > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1346) > > With the same access patterns, we did not have this issue in HBase 0.90.3. The above is other side of the timeout -- the client is gone. Can you explain the rising CPU? Is it iowait on this box because of compactions? Bad disk? Always same regionserver or issue moves around? Sorry for all the questions. 0.92 should be better than 0.90 generally (0.94 even better still -- can you go there?). Interesting that these issues show up post upgrade. I can't think of a reason why the different versions would bring this on... St.Ack +
Stack 2012-11-16, 00:56
-
Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1Vincent Barat 2012-11-16, 16:14
Le 16/11/12 01:56, Stack a �crit :
> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <[EMAIL PROTECTED]> wrote: >> It happens when several tables are being compacted and/or when there is >> several scanners running. > > It happens for a particular region? Anything you can tell about the > server looking in your cluster monitoring? Is it running hot? What > do the hbase regionserver stats in UI say? Anything interesting about > compaction queues or requests? Hi, thanks for your answser Stack. I will take the lead on that thread from now on. It does not happens on any particular region. Actually, things get better now since compactions have been performed on all tables and have been stopped. Nevertheless, we face a dramatic decrease of performances (especially on random gets) of the overall cluster: Despite the fact we double our number of region servers (from 8 to 16) and despite the fact that these region server CPU load are just about 10% to 30%, performances are really bad : very often an light increase of request lead to a clients locked on request, very long response time. It looks like a contention / deadlock somewhere in the HBase client and C code. > > If you look at the thread dump all handlers are occupied serving > requests? These timedout requests couldn't get into the server? We will investigate on that and report to you. >> Before the timeouts, we observe an increasing CPU load on a single region >> server and if we add region servers and wait for rebalancing, we always >> have the same region server causing problems like these: >> >> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.HBaseServer: IPC >> Server Responder, call >> multi(org.apache.hadoop.hbase.client.MultiAction@2c3da1aa), rpc >> version=1, client version=29, methodsFingerPrint=54742778 from >> <ip>:45334: output error >> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.HBaseServer: IPC >> Server handler 3 on 60020 caught: java.nio.channels.ClosedChannelException >> at sun.nio.ch.SocketChannelImpl.ensureWriteOpen(SocketChannelImpl.java:133) >> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:324) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer.channelWrite(HBaseServer.java:1653) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Responder. >> processResponse(HBaseServer.java:924) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Responder. >> doRespond(HBaseServer.java:1003) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Call.sendResponseIfReady( >> HBaseServer.java:409) >> at >> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1346) >> >> With the same access patterns, we did not have this issue in HBase 0.90.3. > > The above is other side of the timeout -- the client is gone. > > Can you explain the rising CPU? No there is no explanation (no high access a a given region for exemple). But this specific problem has gone when we finished compactions. > Is it iowait on this box because of > compactions? Bad disk? Always same regionserver or issue moves > around? > > Sorry for all the questions. 0.92 should be better than 0.90 Our experience is currently the exact opposite : for us, 0.92 seems to be times slower than the 0.90.3. > generally (0.94 even better still -- can you go there?). We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we cannot go back to 0.90.3, since there is apparently a modification of the format of the ROOT table). The upgrade works, but the downgrade not. And we are afraid of having even more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with some days of data loses). Thanks for your reply we will continue to investigate. > Interesting > that these issues show up post upgrade. I can't think of a reason why > the different versions would bring this on... > > St.Ack > +
Vincent Barat 2012-11-16, 16:14
-
Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1Ted Yu 2012-11-16, 17:13
Vincent:
What's the value for hbase.regionserver.handler.count ? I assume you keep the same value as that from 0.90.3 Thanks On Fri, Nov 16, 2012 at 8:14 AM, Vincent Barat <[EMAIL PROTECTED]>wrote: > Le 16/11/12 01:56, Stack a écrit : > > On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <[EMAIL PROTECTED]> >> wrote: >> >>> It happens when several tables are being compacted and/or when there is >>> several scanners running. >>> >> >> It happens for a particular region? Anything you can tell about the >> server looking in your cluster monitoring? Is it running hot? What >> do the hbase regionserver stats in UI say? Anything interesting about >> compaction queues or requests? >> > > Hi, thanks for your answser Stack. I will take the lead on that thread > from now on. > > It does not happens on any particular region. Actually, things get better > now since compactions have been performed on all tables and have been > stopped. > > Nevertheless, we face a dramatic decrease of performances (especially on > random gets) of the overall cluster: > > Despite the fact we double our number of region servers (from 8 to 16) and > despite the fact that these region server CPU load are just about 10% to > 30%, performances are really bad : very often an light increase of request > lead to a clients locked on request, very long response time. It looks like > a contention / deadlock somewhere in the HBase client and C code. > > > >> If you look at the thread dump all handlers are occupied serving >> requests? These timedout requests couldn't get into the server? >> > We will investigate on that and report to you. > > > Before the timeouts, we observe an increasing CPU load on a single region >>> server and if we add region servers and wait for rebalancing, we always >>> have the same region server causing problems like these: >>> >>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>> Server Responder, call >>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>> version=1, client version=29, methodsFingerPrint=54742778 from >>> <ip>:45334: output error >>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>> Server handler 3 on 60020 caught: java.nio.channels.** >>> ClosedChannelException >>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>> SocketChannelImpl.java:133) >>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** >>> HBaseServer.java:1653) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>> processResponse(HBaseServer.**java:924) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>> doRespond(HBaseServer.java:**1003) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( >>> HBaseServer.java:409) >>> at >>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(** >>> HBaseServer.java:1346) >>> >>> With the same access patterns, we did not have this issue in HBase >>> 0.90.3. >>> >> >> The above is other side of the timeout -- the client is gone. >> >> Can you explain the rising CPU? >> > No there is no explanation (no high access a a given region for exemple). > But this specific problem has gone when we finished compactions. > > > Is it iowait on this box because of >> compactions? Bad disk? Always same regionserver or issue moves >> around? >> >> Sorry for all the questions. 0.92 should be better than 0.90 >> > Our experience is currently the exact opposite : for us, 0.92 seems to be > times slower than the 0.90.3. > > generally (0.94 even better still -- can you go there?). >> > > We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we > cannot go back to 0.90.3, since there is apparently a modification of the > format of the ROOT table). > The upgrade works, but the downgrade not. And we are afraid of having even > more "new" problems with 0.94 and be forced to rollback to 0.90.3 (with +
Ted Yu 2012-11-16, 17:13
-
Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1Vincent Barat 2012-11-16, 17:20
Hi,
Right now (and previously with 0.90.3) we were using the default value (10). We are trying right now to increase to 30 to see if it is better. Thanks for your concern Le 16/11/12 18:13, Ted Yu a �crit : > Vincent: > What's the value for hbase.regionserver.handler.count ? > > I assume you keep the same value as that from 0.90.3 > > Thanks > > On Fri, Nov 16, 2012 at 8:14 AM, Vincent Barat <[EMAIL PROTECTED]>wrote: > >> Le 16/11/12 01:56, Stack a �crit : >> >> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot <[EMAIL PROTECTED]> >>> wrote: >>> >>>> It happens when several tables are being compacted and/or when there is >>>> several scanners running. >>>> >>> It happens for a particular region? Anything you can tell about the >>> server looking in your cluster monitoring? Is it running hot? What >>> do the hbase regionserver stats in UI say? Anything interesting about >>> compaction queues or requests? >>> >> Hi, thanks for your answser Stack. I will take the lead on that thread >> from now on. >> >> It does not happens on any particular region. Actually, things get better >> now since compactions have been performed on all tables and have been >> stopped. >> >> Nevertheless, we face a dramatic decrease of performances (especially on >> random gets) of the overall cluster: >> >> Despite the fact we double our number of region servers (from 8 to 16) and >> despite the fact that these region server CPU load are just about 10% to >> 30%, performances are really bad : very often an light increase of request >> lead to a clients locked on request, very long response time. It looks like >> a contention / deadlock somewhere in the HBase client and C code. >> >> >> >>> If you look at the thread dump all handlers are occupied serving >>> requests? These timedout requests couldn't get into the server? >>> >> We will investigate on that and report to you. >> >> >> Before the timeouts, we observe an increasing CPU load on a single region >>>> server and if we add region servers and wait for rebalancing, we always >>>> have the same region server causing problems like these: >>>> >>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>> Server Responder, call >>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>> <ip>:45334: output error >>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>> Server handler 3 on 60020 caught: java.nio.channels.** >>>> ClosedChannelException >>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>>> SocketChannelImpl.java:133) >>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** >>>> HBaseServer.java:1653) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>> processResponse(HBaseServer.**java:924) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>> doRespond(HBaseServer.java:**1003) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( >>>> HBaseServer.java:409) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(** >>>> HBaseServer.java:1346) >>>> >>>> With the same access patterns, we did not have this issue in HBase >>>> 0.90.3. >>>> >>> The above is other side of the timeout -- the client is gone. >>> >>> Can you explain the rising CPU? >>> >> No there is no explanation (no high access a a given region for exemple). >> But this specific problem has gone when we finished compactions. >> >> >> Is it iowait on this box because of >>> compactions? Bad disk? Always same regionserver or issue moves >>> around? >>> >>> Sorry for all the questions. 0.92 should be better than 0.90 >>> >> Our experience is currently the exact opposite : for us, 0.92 seems to be >> times slower than the 0.90.3. >> >> generally (0.94 even better still -- can you go there?). >> We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we *Vincent Barat* *CTO * logo *Contact info * [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]%20> www.capptain.com <http://www.capptain.com> Cell: +33 6 15 41 15 18 *Rennes Office * Office: +33 2 99 65 69 13 10 rue Jean-Marie Duhamel 35000 Rennes France *Paris Office * Office: +33 1 84 06 13 85 Fax: +33 9 57 72 20 18 18 rue Tronchet 75008 Paris France IMPORTANT NOTICE -- UBIKOD and CAPPTAIN are registered trademarks of UBIKOD S.A.R.L., all copyrights are reserved. The contents of this email and attachments are confidential and may be subject to legal privilege and/or protected by copyright. Copying or communicating any part of it to others is prohibited and may be unlawful. If you are not the intended recipient you must not use, copy, distribute or rely on this email and should please return it immediately or notify us by telephone. At present the integrity of email across the Internet cannot be guaranteed. Therefore UBIKOD S.A.R.L. will not accept liability for any claims arising as a result of the use of this medium for transmissions by or to UBIKOD S.A.R.L.. UBIKOD S.A.R.L. may exercise any of its rights under relevant law, to monitor the content of all electronic communications. You should therefore be aware that this communication and any responses might have been monitored, and may be accessed by UBIKOD S.A.R.L. The views expressed in this document are that of the individual and may not necessarily constitute or imply its endorsement or recommendation by UBIKOD S.A.R.L. The content of this electronic mail may be subject to the confidentiality terms of a "Non-Disclosure Agreement" (NDA). +
Vincent Barat 2012-11-16, 17:20
-
Re: Lots of SocketTimeoutException for gets and puts since HBase 0.92.1Vincent Barat 2012-11-16, 19:55
Hi,
Right now (and previously with 0.90.3) we were using the default value (10). We are trying right now to increase to 30 to see if it is better. Thanks for your concern Le 16/11/12 18:13, Ted Yu a �crit : > Vincent: > What's the value for hbase.regionserver.handler.count ? > > I assume you keep the same value as that from 0.90.3 > > Thanks > > On Fri, Nov 16, 2012 at 8:14 AM, Vincent Barat<[EMAIL PROTECTED]>wrote: > >> Le 16/11/12 01:56, Stack a �crit : >> >> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>> wrote: >>> >>>> It happens when several tables are being compacted and/or when there is >>>> several scanners running. >>>> >>> It happens for a particular region? Anything you can tell about the >>> server looking in your cluster monitoring? Is it running hot? What >>> do the hbase regionserver stats in UI say? Anything interesting about >>> compaction queues or requests? >>> >> Hi, thanks for your answser Stack. I will take the lead on that thread >> from now on. >> >> It does not happens on any particular region. Actually, things get better >> now since compactions have been performed on all tables and have been >> stopped. >> >> Nevertheless, we face a dramatic decrease of performances (especially on >> random gets) of the overall cluster: >> >> Despite the fact we double our number of region servers (from 8 to 16) and >> despite the fact that these region server CPU load are just about 10% to >> 30%, performances are really bad : very often an light increase of request >> lead to a clients locked on request, very long response time. It looks like >> a contention / deadlock somewhere in the HBase client and C code. >> >> >> >>> If you look at the thread dump all handlers are occupied serving >>> requests? These timedout requests couldn't get into the server? >>> >> We will investigate on that and report to you. >> >> >> Before the timeouts, we observe an increasing CPU load on a single region >>>> server and if we add region servers and wait for rebalancing, we always >>>> have the same region server causing problems like these: >>>> >>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>> Server Responder, call >>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>> <ip>:45334: output error >>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>> Server handler 3 on 60020 caught: java.nio.channels.** >>>> ClosedChannelException >>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>>> SocketChannelImpl.java:133) >>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** >>>> HBaseServer.java:1653) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>> processResponse(HBaseServer.**java:924) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>> doRespond(HBaseServer.java:**1003) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( >>>> HBaseServer.java:409) >>>> at >>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(** >>>> HBaseServer.java:1346) >>>> >>>> With the same access patterns, we did not have this issue in HBase >>>> 0.90.3. >>>> >>> The above is other side of the timeout -- the client is gone. >>> >>> Can you explain the rising CPU? >>> >> No there is no explanation (no high access a a given region for exemple). >> But this specific problem has gone when we finished compactions. >> >> >> Is it iowait on this box because of >>> compactions? Bad disk? Always same regionserver or issue moves >>> around? >>> >>> Sorry for all the questions. 0.92 should be better than 0.90 >>> >> Our experience is currently the exact opposite : for us, 0.92 seems to be >> times slower than the 0.90.3. >> >> generally (0.94 even better still -- can you go there?). >> We can go to 0.94 but unfortunately, we CANNOT GO BACK (the same way we +
Vincent Barat 2012-11-16, 19:55
|