|
|
-
X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Vincent Barat 2012-11-20, 16:21
Hi,
We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, more rpc handler (from 10 to 30) longer timeout, but nothing seems to improve the response time:
- Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom read response time
... despite the fact that our RS CPU load is really low (10%)
Note: we have not (yet) activated MSlabs, nor direct read on HDFS.
Any idea please ? I'm really stuck on that issue.
Best regards,
Le 16/11/12 20:55, Vincent Barat a �crit : > Hi, > > Right now (and previously with 0.90.3) we were using the default > value (10). > We are trying right now to increase to 30 to see if it is better. > > Thanks for your concern > > Le 16/11/12 18:13, Ted Yu a �crit : >> Vincent: >> What's the value for hbase.regionserver.handler.count ? >> >> I assume you keep the same value as that from 0.90.3 >> >> Thanks >> >> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >> Barat<[EMAIL PROTECTED]>wrote: >> >>> Le 16/11/12 01:56, Stack a �crit : >>> >>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume >>> Perrot<[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> It happens when several tables are being compacted and/or when >>>>> there is >>>>> several scanners running. >>>>> >>>> It happens for a particular region? Anything you can tell >>>> about the >>>> server looking in your cluster monitoring? Is it running hot? >>>> What >>>> do the hbase regionserver stats in UI say? Anything >>>> interesting about >>>> compaction queues or requests? >>>> >>> Hi, thanks for your answser Stack. I will take the lead on that >>> thread >>> from now on. >>> >>> It does not happens on any particular region. Actually, things >>> get better >>> now since compactions have been performed on all tables and have >>> been >>> stopped. >>> >>> Nevertheless, we face a dramatic decrease of performances >>> (especially on >>> random gets) of the overall cluster: >>> >>> Despite the fact we double our number of region servers (from 8 >>> to 16) and >>> despite the fact that these region server CPU load are just >>> about 10% to >>> 30%, performances are really bad : very often an light increase >>> of request >>> lead to a clients locked on request, very long response time. It >>> looks like >>> a contention / deadlock somewhere in the HBase client and C code. >>> >>> >>> >>>> If you look at the thread dump all handlers are occupied serving >>>> requests? These timedout requests couldn't get into the server? >>>> >>> We will investigate on that and report to you. >>> >>> >>> Before the timeouts, we observe an increasing CPU load on a >>> single region >>>>> server and if we add region servers and wait for rebalancing, >>>>> we always >>>>> have the same region server causing problems like these: >>>>> >>>>> 2012-11-14 20:47:08,443 WARN >>>>> org.apache.hadoop.ipc.**HBaseServer: IPC >>>>> Server Responder, call >>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>>> <ip>:45334: output error >>>>> 2012-11-14 20:47:08,443 WARN >>>>> org.apache.hadoop.ipc.**HBaseServer: IPC >>>>> Server handler 3 on 60020 caught: java.nio.channels.** >>>>> ClosedChannelException >>>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>>>> SocketChannelImpl.java:133) >>>>> at >>>>> sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324) >>>>> >>>>> at >>>>> org.apache.hadoop.hbase.ipc.**HBaseServer.channelWrite(** >>>>> HBaseServer.java:1653) >>>>> at >>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>>> processResponse(HBaseServer.**java:924) >>>>> at >>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Responder. >>>>> doRespond(HBaseServer.java:**1003) >>>>> at >>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Call.**sendResponseIfReady( >>>>> >>>>> HBaseServer.java:409) >>>>> at >>>>> org.apache.hadoop.hbase.ipc.**HBaseServer$Handler.run(**
+
Vincent Barat 2012-11-20, 16:21
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Stack 2012-11-21, 05:05
On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, > more rpc handler (from 10 to 30) longer timeout, but nothing seems to > improve the response time: > You have taken a look at the perf chapter Vincent: http://hbase.apache.org/book.html#performanceYou carried forward your old hbase-default.xml or did you remove it (0.92 should have defaults in hbase-X.X.X.jar -- some defaults will have changed). > - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 Any scan caching going on? > - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom > read response time > The gets are returning lots of data? (If you thread dump the server at this time -- see at top of the regionserver UI -- can you see what we are hung up on? Are all handlers occupied?). > ... despite the fact that our RS CPU load is really low (10%) > As has been suggested earlier, perhaps up the handlers? > Note: we have not (yet) activated MSlabs, nor direct read on HDFS. > MSlab will help you avoid stop-the-world GCs. Direct read of HDFS should speed up random access. St.Ack > Any idea please ? I'm really stuck on that issue. > > Best regards, > > Le 16/11/12 20:55, Vincent Barat a écrit : >> >> Hi, >> >> Right now (and previously with 0.90.3) we were using the default value >> (10). >> We are trying right now to increase to 30 to see if it is better. >> >> Thanks for your concern >> >> Le 16/11/12 18:13, Ted Yu a écrit : >>> >>> Vincent: >>> What's the value for hbase.regionserver.handler.count ? >>> >>> I assume you keep the same value as that from 0.90.3 >>> >>> Thanks >>> >>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>> Barat<[EMAIL PROTECTED]>wrote: >>> >>>> Le 16/11/12 01:56, Stack a écrit : >>>> >>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>>>> >>>>> wrote: >>>>> >>>>>> It happens when several tables are being compacted and/or when there >>>>>> is >>>>>> several scanners running. >>>>>> >>>>> It happens for a particular region? Anything you can tell about the >>>>> server looking in your cluster monitoring? Is it running hot? What >>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>> compaction queues or requests? >>>>> >>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>> from now on. >>>> >>>> It does not happens on any particular region. Actually, things get >>>> better >>>> now since compactions have been performed on all tables and have been >>>> stopped. >>>> >>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>> random gets) of the overall cluster: >>>> >>>> Despite the fact we double our number of region servers (from 8 to 16) >>>> and >>>> despite the fact that these region server CPU load are just about 10% to >>>> 30%, performances are really bad : very often an light increase of >>>> request >>>> lead to a clients locked on request, very long response time. It looks >>>> like >>>> a contention / deadlock somewhere in the HBase client and C code. >>>> >>>> >>>> >>>>> If you look at the thread dump all handlers are occupied serving >>>>> requests? These timedout requests couldn't get into the server? >>>>> >>>> We will investigate on that and report to you. >>>> >>>> >>>> Before the timeouts, we observe an increasing CPU load on a single >>>> region >>>>>> >>>>>> server and if we add region servers and wait for rebalancing, we >>>>>> always >>>>>> have the same region server causing problems like these: >>>>>> >>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>> Server Responder, call >>>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>>>> <ip>:45334: output error >>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>> Server handler 3 on 60020 caught: java.nio.channels.**
+
Stack 2012-11-21, 05:05
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Vincent Barat 2012-11-21, 08:23
Le 21/11/12 06:05, Stack a �crit : > On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: >> We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, >> more rpc handler (from 10 to 30) longer timeout, but nothing seems to >> improve the response time: >> > You have taken a look at the perf chapter Vincent: > http://hbase.apache.org/book.html#performance> > You carried forward your old hbase-default.xml or did you remove it > (0.92 should have defaults in hbase-X.X.X.jar -- some defaults will > have changed). We use the new default settings for HBase, just a few changes (more RPC handlers and longer timeout (but this last was a bad idea). >> - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 > Any scan caching going on? yes the cache is set between 64 and 1024 depending on the need >> - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom >> read response time >> > The gets are returning lots of data? (If you thread dump the server at > this time -- see at top of the regionserver UI -- can you see what we > are hung up on? Are all handlers occupied?). We will check this... >> ... despite the fact that our RS CPU load is really low (10%) >> > As has been suggested earlier, perhaps up the handlers? > > >> Note: we have not (yet) activated MSlabs, nor direct read on HDFS. >> > MSlab will help you avoid stop-the-world GCs. Direct read of HDFS > should speed up random access. OK, I guess we will give it a try, but on a second step. Thansk for your help > > St.Ack > >> Any idea please ? I'm really stuck on that issue. >> >> Best regards, >> >> Le 16/11/12 20:55, Vincent Barat a �crit : >>> Hi, >>> >>> Right now (and previously with 0.90.3) we were using the default value >>> (10). >>> We are trying right now to increase to 30 to see if it is better. >>> >>> Thanks for your concern >>> >>> Le 16/11/12 18:13, Ted Yu a �crit : >>>> Vincent: >>>> What's the value for hbase.regionserver.handler.count ? >>>> >>>> I assume you keep the same value as that from 0.90.3 >>>> >>>> Thanks >>>> >>>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>>> Barat<[EMAIL PROTECTED]>wrote: >>>> >>>>> Le 16/11/12 01:56, Stack a �crit : >>>>> >>>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> It happens when several tables are being compacted and/or when there >>>>>>> is >>>>>>> several scanners running. >>>>>>> >>>>>> It happens for a particular region? Anything you can tell about the >>>>>> server looking in your cluster monitoring? Is it running hot? What >>>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>>> compaction queues or requests? >>>>>> >>>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>>> from now on. >>>>> >>>>> It does not happens on any particular region. Actually, things get >>>>> better >>>>> now since compactions have been performed on all tables and have been >>>>> stopped. >>>>> >>>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>>> random gets) of the overall cluster: >>>>> >>>>> Despite the fact we double our number of region servers (from 8 to 16) >>>>> and >>>>> despite the fact that these region server CPU load are just about 10% to >>>>> 30%, performances are really bad : very often an light increase of >>>>> request >>>>> lead to a clients locked on request, very long response time. It looks >>>>> like >>>>> a contention / deadlock somewhere in the HBase client and C code. >>>>> >>>>> >>>>> >>>>>> If you look at the thread dump all handlers are occupied serving >>>>>> requests? These timedout requests couldn't get into the server? >>>>>> >>>>> We will investigate on that and report to you. >>>>> >>>>> >>>>> Before the timeouts, we observe an increasing CPU load on a single >>>>> region >>>>>>> server and if we add region servers and wait for rebalancing, we >>>>>>> always >>>>>>> have the same region server causing problems like these:
+
Vincent Barat 2012-11-21, 08:23
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Alok Singh 2012-11-21, 04:53
Do your PUTs and GETs have small amounts of data? If yes, then you can increase the number of handlers. We have a 8-node cluster on 0.92.1, and these are some of the setting we changed from 0.90.4
hbase.regionserver.handler.count = 150 hbase.hregion.max.filesize=2147483648 (2GB)
The regions servers are run with a 16GB heap (-Xmx16000M)
With these settings, at peak we can handle ~2K concurrent clients.
Alok
On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > Hi, > > We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, > more rpc handler (from 10 to 30) longer timeout, but nothing seems to > improve the response time: > > - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 > - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom > read response time > > ... despite the fact that our RS CPU load is really low (10%) > > Note: we have not (yet) activated MSlabs, nor direct read on HDFS. > > Any idea please ? I'm really stuck on that issue. > > Best regards, > > Le 16/11/12 20:55, Vincent Barat a écrit : >> >> Hi, >> >> Right now (and previously with 0.90.3) we were using the default value >> (10). >> We are trying right now to increase to 30 to see if it is better. >> >> Thanks for your concern >> >> Le 16/11/12 18:13, Ted Yu a écrit : >>> >>> Vincent: >>> What's the value for hbase.regionserver.handler.count ? >>> >>> I assume you keep the same value as that from 0.90.3 >>> >>> Thanks >>> >>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>> Barat<[EMAIL PROTECTED]>wrote: >>> >>>> Le 16/11/12 01:56, Stack a écrit : >>>> >>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>>>> >>>>> wrote: >>>>> >>>>>> It happens when several tables are being compacted and/or when there >>>>>> is >>>>>> several scanners running. >>>>>> >>>>> It happens for a particular region? Anything you can tell about the >>>>> server looking in your cluster monitoring? Is it running hot? What >>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>> compaction queues or requests? >>>>> >>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>> from now on. >>>> >>>> It does not happens on any particular region. Actually, things get >>>> better >>>> now since compactions have been performed on all tables and have been >>>> stopped. >>>> >>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>> random gets) of the overall cluster: >>>> >>>> Despite the fact we double our number of region servers (from 8 to 16) >>>> and >>>> despite the fact that these region server CPU load are just about 10% to >>>> 30%, performances are really bad : very often an light increase of >>>> request >>>> lead to a clients locked on request, very long response time. It looks >>>> like >>>> a contention / deadlock somewhere in the HBase client and C code. >>>> >>>> >>>> >>>>> If you look at the thread dump all handlers are occupied serving >>>>> requests? These timedout requests couldn't get into the server? >>>>> >>>> We will investigate on that and report to you. >>>> >>>> >>>> Before the timeouts, we observe an increasing CPU load on a single >>>> region >>>>>> >>>>>> server and if we add region servers and wait for rebalancing, we >>>>>> always >>>>>> have the same region server causing problems like these: >>>>>> >>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>> Server Responder, call >>>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>>>> <ip>:45334: output error >>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>> Server handler 3 on 60020 caught: java.nio.channels.** >>>>>> ClosedChannelException >>>>>> at sun.nio.ch.SocketChannelImpl.**ensureWriteOpen(** >>>>>> SocketChannelImpl.java:133) >>>>>> at sun.nio.ch.SocketChannelImpl.**write(SocketChannelImpl.java:**324)
+
Alok Singh 2012-11-21, 04:53
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Vincent Barat 2012-11-21, 09:02
Hi, I've checked my 30 RPC handler, they are all in a WAITING state: Here is some extract for one of our RS (this is similar to all of them): requestsPerSecond=593, numberOfOnlineRegions=584, numberOfStores=1147, numberOfStorefiles=1980, storefileIndexSizeMB=15, rootIndexSizeKB=16219, totalStaticIndexSizeKB=246127, totalStaticBloomSizeKB=12936, memstoreSizeMB=1421, readRequestsCount=633241097, writeRequestsCount=9375846, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=3042, maxHeapMB=4591, blockCacheSizeMB=890.19, blockCacheFreeMB=257.65, blockCacheCount=14048, blockCacheHitCount=5854936149, blockCacheMissCount=14761288, blockCacheEvictedCount=4870523, blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%, hdfsBlocksLocalityIndex=29 Le 21/11/12 05:53, Alok Singh a �crit : > Do your PUTs and GETs have small amounts of data? If yes, then you can > increase the number of handlers. > We have a 8-node cluster on 0.92.1, and these are some of the setting > we changed from 0.90.4 > > hbase.regionserver.handler.count = 150 > hbase.hregion.max.filesize=2147483648 (2GB) > > The regions servers are run with a 16GB heap (-Xmx16000M) > > With these settings, at peak we can handle ~2K concurrent clients. > > Alok > > On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: >> Hi, >> >> We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, >> more rpc handler (from 10 to 30) longer timeout, but nothing seems to >> improve the response time: >> >> - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 >> - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom >> read response time >> >> ... despite the fact that our RS CPU load is really low (10%) >> >> Note: we have not (yet) activated MSlabs, nor direct read on HDFS. >> >> Any idea please ? I'm really stuck on that issue. >> >> Best regards, >> >> Le 16/11/12 20:55, Vincent Barat a �crit : >>> Hi, >>> >>> Right now (and previously with 0.90.3) we were using the default value >>> (10). >>> We are trying right now to increase to 30 to see if it is better. >>> >>> Thanks for your concern >>> >>> Le 16/11/12 18:13, Ted Yu a �crit : >>>> Vincent: >>>> What's the value for hbase.regionserver.handler.count ? >>>> >>>> I assume you keep the same value as that from 0.90.3 >>>> >>>> Thanks >>>> >>>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>>> Barat<[EMAIL PROTECTED]>wrote: >>>> >>>>> Le 16/11/12 01:56, Stack a �crit : >>>>> >>>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> It happens when several tables are being compacted and/or when there >>>>>>> is >>>>>>> several scanners running. >>>>>>> >>>>>> It happens for a particular region? Anything you can tell about the >>>>>> server looking in your cluster monitoring? Is it running hot? What >>>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>>> compaction queues or requests? >>>>>> >>>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>>> from now on. >>>>> >>>>> It does not happens on any particular region. Actually, things get >>>>> better >>>>> now since compactions have been performed on all tables and have been >>>>> stopped. >>>>> >>>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>>> random gets) of the overall cluster: >>>>> >>>>> Despite the fact we double our number of region servers (from 8 to 16) >>>>> and >>>>> despite the fact that these region server CPU load are just about 10% to >>>>> 30%, performances are really bad : very often an light increase of >>>>> request >>>>> lead to a clients locked on request, very long response time. It looks >>>>> like >>>>> a contention / deadlock somewhere in the HBase client and C code. >>>>> >>>>> >>>>> >>>>>> If you look at the thread dump all handlers are occupied serving >>>>>> requests? These timedout requests couldn't get into the server? >>>>> *Vincent Barat* *CTO * logo *Contact info * [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]%20> www.capptain.com < http://www.capptain.com>Cell: +33 6 15 41 15 18 *Rennes Office * Office: +33 2 99 65 69 13 10 rue Jean-Marie Duhamel 35000 Rennes France *Paris Office * Office: +33 1 84 06 13 85 Fax: +33 9 57 72 20 18 18 rue Tronchet 75008 Paris France IMPORTANT NOTICE -- UBIKOD and CAPPTAIN are registered trademarks of UBIKOD S.A.R.L., all copyrights are reserved. The contents of this email and attachments are confidential and may be subject to legal privilege and/or protected by copyright. Copying or communicating any part of it to others is prohibited and may be unlawful. If you are not the intended recipient you must not use, copy, distribute or rely on this email and should please return it immediately or notify us by telephone. At present the integrity of email across the Internet cannot be guaranteed. Therefore UBIKOD S.A.R.L. will not accept liability for any claims arising as a result of the use of this medium for transmissions by or to UBIKOD S.A.R.L.. UBIKOD S.A.R.L. may exercise any of its rights under relevant law, to monitor the content of all electronic communications. You should therefore be aware that this communication and any responses might have been monitored, and may be accessed by UBIKOD S.A.R.L. The views expressed in this document are that of the individual and may not necessarily constitute or imply its endorsement or recommendation by UBIKOD S.A.R.L. The content of this electronic mail may be subject to the confidentiality terms of a "Non-Disclosure Agreement" (NDA).
+
Vincent Barat 2012-11-21, 09:02
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Vincent Barat 2012-11-21, 09:04
Hi,
I've checked my 30 RPC handlers, they are all in a WAITING state:
Thread 89 (PRI IPC Server handler 6 on 60020): State: WAITING Blocked count: 238 Waited count: 617 Waiting on java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@131f139b Stack: sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1299) Here is some extract for one of our RS (this is similar to all of them):
requestsPerSecond=593, numberOfOnlineRegions=584, numberOfStores=1147, numberOfStorefiles=1980, storefileIndexSizeMB=15, rootIndexSizeKB=16219, totalStaticIndexSizeKB=246127, totalStaticBloomSizeKB=12936, memstoreSizeMB=1421, readRequestsCount=633241097, writeRequestsCount=9375846, compactionQueueSize=0, flushQueueSize=0, usedHeapMB=3042, maxHeapMB=4591, blockCacheSizeMB=890.19, blockCacheFreeMB=257.65, blockCacheCount=14048, blockCacheHitCount=5854936149, blockCacheMissCount=14761288, blockCacheEvictedCount=4870523, blockCacheHitRatio=99%, blockCacheHitCachingRatio=99%, hdfsBlocksLocalityIndex=29
Maybe soem advice ?
Le 21/11/12 05:53, Alok Singh a �crit : > Do your PUTs and GETs have small amounts of data? If yes, then you can > increase the number of handlers. > We have a 8-node cluster on 0.92.1, and these are some of the setting > we changed from 0.90.4 > > hbase.regionserver.handler.count = 150 > hbase.hregion.max.filesize=2147483648 (2GB) > > The regions servers are run with a 16GB heap (-Xmx16000M) > > With these settings, at peak we can handle ~2K concurrent clients. > > Alok > > On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat<[EMAIL PROTECTED]> wrote: >> Hi, >> >> We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, >> more rpc handler (from 10 to 30) longer timeout, but nothing seems to >> improve the response time: >> >> - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 >> - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom >> read response time >> >> ... despite the fact that our RS CPU load is really low (10%) >> >> Note: we have not (yet) activated MSlabs, nor direct read on HDFS. >> >> Any idea please ? I'm really stuck on that issue. >> >> Best regards, >> >> Le 16/11/12 20:55, Vincent Barat a �crit : >>> Hi, >>> >>> Right now (and previously with 0.90.3) we were using the default value >>> (10). >>> We are trying right now to increase to 30 to see if it is better. >>> >>> Thanks for your concern >>> >>> Le 16/11/12 18:13, Ted Yu a �crit : >>>> Vincent: >>>> What's the value for hbase.regionserver.handler.count ? >>>> >>>> I assume you keep the same value as that from 0.90.3 >>>> >>>> Thanks >>>> >>>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>>> Barat<[EMAIL PROTECTED]>wrote: >>>> >>>>> Le 16/11/12 01:56, Stack a �crit : >>>>> >>>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> It happens when several tables are being compacted and/or when there >>>>>>> is >>>>>>> several scanners running. >>>>>>> >>>>>> It happens for a particular region? Anything you can tell about the >>>>>> server looking in your cluster monitoring? Is it running hot? What >>>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>>> compaction queues or requests? >>>>>> >>>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>>> from now on. >>>>> >>>>> It does not happens on any particular region. Actually, things get >>>>> better >>>>> now since compactions have been performed on all tables and have been >>>>> stopped. >>>>> >>>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>>> random gets) of the overall cluster:
+
Vincent Barat 2012-11-21, 09:04
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Stack 2012-11-21, 17:39
On Wed, Nov 21, 2012 at 1:04 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: > Hi, > > I've checked my 30 RPC handlers, they are all in a WAITING state: > > Thread 89 (PRI IPC Server handler 6 on 60020): > State: WAITING > Blocked count: 238 > Waited count: 617 > Waiting on > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@131f139b > Stack: > sun.misc.Unsafe.park(Native Method) > java.util.concurrent.locks.LockSupport.park(LockSupport.java:158) > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1987) > java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:399) > org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1299) >
So Vincent, the servers are quiet? Which would match your low CPU observation. Clients are unable to send them load for some reason? How many disks. What is your block cache hit number (see regionserver log -- it gets printed every so often .... or in the below I see 99% so your numbers should be good coming out of the regionserver).
> > > Here is some extract for one of our RS (this is similar to all of them): > > requestsPerSecond=593, numberOfOnlineRegions=584, numberOfStores=1147, > numberOfStorefiles=1980, storefileIndexSizeMB=15, rootIndexSizeKB=16219, > totalStaticIndexSizeKB=246127, totalStaticBloomSizeKB=12936, > memstoreSizeMB=1421, readRequestsCount=633241097, > writeRequestsCount=9375846, compactionQueueSize=0, flushQueueSize=0, > usedHeapMB=3042, maxHeapMB=4591, blockCacheSizeMB=890.19, > blockCacheFreeMB=257.65, blockCacheCount=14048, > blockCacheHitCount=5854936149, blockCacheMissCount=14761288, > blockCacheEvictedCount=4870523, blockCacheHitRatio=99%, > blockCacheHitCachingRatio=99%, hdfsBlocksLocalityIndex=29 >
600 regions is a lot per server. You should put it on your TODO list to have less per server -- bigger regions which you can do now you are on 0.92.
If you major compact -- do it when site is less heavily loaded -- does our performance go up.
Are all query types slow or just certain types?
St.Ack
+
Stack 2012-11-21, 17:39
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Vincent Barat 2012-11-21, 18:35
Le 21/11/12 18:39, Stack a �crit : > So Vincent, the servers are quiet? Which would match your low CPU > observation. Clients are unable to send them load for some reason? > How many disks. What is your block cache hit number (see regionserver > log -- it gets printed every so often .... or in the below I see 99% > so your numbers should be good coming out of the regionserver). It does not seem to be a load issue : as you say CPU is low and RPC handlers are under used. We got plenty of disk space, and or block cache hit is 99% on all region servers...
Today we tried to remove some region servers (yes: we had only 8 before moving to 0.92, and we added 8 more because we thought it was a performance issue). We now have 12 of them, are actually the perfs are similar (just more CPU load of course, but similar response time). > 600 regions is a lot per server. You should put it on your TODO list > to have less per server -- bigger regions which you can do now you are > on 0.92. This is definitively in our TODO. Nevertheless, our 8 RS (0.90.3) before the move had more than 1100 regions each! Without any issue. We increased or region size by X4 (now we use default 1GB setting). And we plan to merge some tables. > > If you major compact -- do it when site is less heavily loaded -- does > our performance go up. > > Are all query types slow or just certain types? actually thing are ok for a time (say 2 to 4ms response time) then we got "scanner lease" exeptions... We cannot figure out what triggers this exception (we though it was a contention somewhere, or a server slow down, but our last investigation seem to point a bug between server and clients).
Here is a typical set of exceptiojn we have from time to time:
client (a PIG script using HBaseStorage): ----------------------------------
2012-11-21 14:47:29,925 | ERROR | main | Launcher | Backend error message org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease '4537659031468873643' does not exist at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2117) at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server.call(WritableRpcEngine.java:364) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:96) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:84) at org.apache.hadoop.hbase.client.ScannerCallable.call(ScannerCallable.java:39) at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionServerWithRetries(HConnectionManager.java:1325) at org.apache.hadoop.hbase.client.HTable$ClientScanner.next(HTable.java:1293) at org.apache.hadoop.hbase.mapreduce.TableRecordReaderImpl.nextKeyValue(TableRecordReaderImpl.java:133) at org.apache.hadoop.hbase.mapreduce.TableRecordReader.nextKeyValue(TableRecordReader.java:142) at org.apache.pig.backend.hadoop.hbase.HBaseTableInputFormat$HBaseTableRecordReader.nextKeyValue(HBaseTableInputFormat.java:162) at org.apache.pig.backend.hadoop.hbase.HBaseStorage.getNext(HBaseStorage.java:452) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader.nextKeyValue(PigRecordReader.java:194) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121) at org.apache.hadoop.mapred.Child.main(Child.java:249)
Region server:
2012-11-21 14:45:55,199 ERROR org.apache.hadoop.hbase.regionserver.HRegionServer: org.apache.hadoop.hbase.regionserver.LeaseException: lease '4537659031468873643' does not exist at org.apache.hadoop.hbase.regionserver.Leases.removeLease(Leases.java:231) at org.apache.hadoop.hbase.regionserver.HRegionServer.next(HRegionServer.java:2117) at sun.reflect.GeneratedMethodAccessor25.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:1326)
2012-11-21 14:45:57,320 WARN org.apache.hadoop.ipc.HBaseServer: (responseTooSlow): {"processingtimems":63895,"call":"next(4537659031468873643, 512), rpc version=1, client version=29, methodsFingerPrint=54742778","client":"10.124.45.132:19289","starttimems":13535090\ 93424,"queuetimems":0,"class":"HRegionServer","responsesize":6,"method":"next"} 2012-11-21 14:45:57,320 WARN org.apache.hadoop.ipc.HBaseServer: IPC Server Responder, call next(45376
+
Vincent Barat 2012-11-21, 18:35
-
Re: X3 slow down after moving from HBase 0.90.3 to HBase 0.92.1
Vincent Barat 2012-11-21, 08:18
Yes we put and get small amount of data. We already increased handler count to 30, maybe it is not enough... We will try this. Le 21/11/12 05:53, Alok Singh a �crit : > Do your PUTs and GETs have small amounts of data? If yes, then you can > increase the number of handlers. > We have a 8-node cluster on 0.92.1, and these are some of the setting > we changed from 0.90.4 > > hbase.regionserver.handler.count = 150 > hbase.hregion.max.filesize=2147483648 (2GB) > > The regions servers are run with a 16GB heap (-Xmx16000M) > > With these settings, at peak we can handle ~2K concurrent clients. > > Alok > > On Tue, Nov 20, 2012 at 8:21 AM, Vincent Barat <[EMAIL PROTECTED]> wrote: >> Hi, >> >> We have changed some parameters on our 16(!) region servers : 1GB more -Xmx, >> more rpc handler (from 10 to 30) longer timeout, but nothing seems to >> improve the response time: >> >> - Scans with HBase 0.92 are x3 SLOWER than with HBase 0.90.3 >> - A lot of simultaneous gets lead to a huge slow down of batch put & ramdom >> read response time >> >> ... despite the fact that our RS CPU load is really low (10%) >> >> Note: we have not (yet) activated MSlabs, nor direct read on HDFS. >> >> Any idea please ? I'm really stuck on that issue. >> >> Best regards, >> >> Le 16/11/12 20:55, Vincent Barat a �crit : >>> Hi, >>> >>> Right now (and previously with 0.90.3) we were using the default value >>> (10). >>> We are trying right now to increase to 30 to see if it is better. >>> >>> Thanks for your concern >>> >>> Le 16/11/12 18:13, Ted Yu a �crit : >>>> Vincent: >>>> What's the value for hbase.regionserver.handler.count ? >>>> >>>> I assume you keep the same value as that from 0.90.3 >>>> >>>> Thanks >>>> >>>> On Fri, Nov 16, 2012 at 8:14 AM, Vincent >>>> Barat<[EMAIL PROTECTED]>wrote: >>>> >>>>> Le 16/11/12 01:56, Stack a �crit : >>>>> >>>>> On Thu, Nov 15, 2012 at 5:21 AM, Guillaume Perrot<[EMAIL PROTECTED]> >>>>>> wrote: >>>>>> >>>>>>> It happens when several tables are being compacted and/or when there >>>>>>> is >>>>>>> several scanners running. >>>>>>> >>>>>> It happens for a particular region? Anything you can tell about the >>>>>> server looking in your cluster monitoring? Is it running hot? What >>>>>> do the hbase regionserver stats in UI say? Anything interesting about >>>>>> compaction queues or requests? >>>>>> >>>>> Hi, thanks for your answser Stack. I will take the lead on that thread >>>>> from now on. >>>>> >>>>> It does not happens on any particular region. Actually, things get >>>>> better >>>>> now since compactions have been performed on all tables and have been >>>>> stopped. >>>>> >>>>> Nevertheless, we face a dramatic decrease of performances (especially on >>>>> random gets) of the overall cluster: >>>>> >>>>> Despite the fact we double our number of region servers (from 8 to 16) >>>>> and >>>>> despite the fact that these region server CPU load are just about 10% to >>>>> 30%, performances are really bad : very often an light increase of >>>>> request >>>>> lead to a clients locked on request, very long response time. It looks >>>>> like >>>>> a contention / deadlock somewhere in the HBase client and C code. >>>>> >>>>> >>>>> >>>>>> If you look at the thread dump all handlers are occupied serving >>>>>> requests? These timedout requests couldn't get into the server? >>>>>> >>>>> We will investigate on that and report to you. >>>>> >>>>> >>>>> Before the timeouts, we observe an increasing CPU load on a single >>>>> region >>>>>>> server and if we add region servers and wait for rebalancing, we >>>>>>> always >>>>>>> have the same region server causing problems like these: >>>>>>> >>>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC >>>>>>> Server Responder, call >>>>>>> multi(org.apache.hadoop.hbase.**client.MultiAction@2c3da1aa), rpc >>>>>>> version=1, client version=29, methodsFingerPrint=54742778 from >>>>>>> <ip>:45334: output error >>>>>>> 2012-11-14 20:47:08,443 WARN org.apache.hadoop.ipc.**HBaseServer: IPC
+
Vincent Barat 2012-11-21, 08:18
|
|