|
Jeff Whiting
2012-10-29, 22:55
Stack
2012-10-31, 05:39
Jeff Whiting
2012-11-01, 17:01
Jeremy Carroll
2012-11-01, 17:07
Jeff Whiting
2012-11-01, 19:25
Jeff Whiting
2012-11-01, 23:44
Jeff Whiting
2012-11-02, 00:53
Jean-Daniel Cryans
2012-11-05, 19:50
Jeff Whiting
2012-11-02, 00:44
ramkrishna vasudevan
2012-10-30, 06:43
Jeff Whiting
2012-10-31, 00:21
Jeff Whiting
2012-10-31, 00:40
ramkrishna vasudevan
2012-10-31, 04:45
Jeff Whiting
2012-11-01, 15:14
|
-
Struggling with Region Servers Running out of MemoryJeff Whiting 2012-10-29, 22:55
We have 6 region server given 10G of memory for hbase. Each region server has an average of about
100 regions and across the cluster we are averaging about 100 requests / second with a pretty even read / write load. We are running cdh4 (0.92.1-cdh4.0.1, rUnknown) I feel that looking over our load and our requests that the 10GB of memory should be enough to handle the load and that we shouldn't really be pushing the the memory limits. However what we are seeing is that our memory usage goes up slowly until the region server starts sputtering due to gc collection issues and it will eventually get timed out by zookeeper and be killed. We'll see aborts like this in the log: 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ds5.h1.ut1.qprod.net,60020,1351233245547: Unhandled exception: org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net,60020,1351233245547 as dead server 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server ds5.h1.ut1.qprod.net,60020,1351233245547: regionserver:60020-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf regionserver:60020-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf-0x13959edd45934cf received expired from ZooKeeper, aborting 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [] Which are "caused" by: 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 29014ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 28121ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 31124ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 32209ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 32557ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.Sleeper: We slept 33741ms instead of 3000ms, this is likely due to a long garbage collecting pause and it's usually bad, see http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks in and really kills the region server's performance. We have the jvm metrics kicking out to ganglia and looking at jvm.RegionServer.metrics.memHeapUsedM you can see that it will go up over time and eventually run out of memory. I can also see in hmaster:60010/master-status that the usedHeapMB just goes up and I can make a pretty educated guess as to what server will go down next. It will take several days to a week of continuous running (after restarting a region server) before we have a potential problem. Our next one to go will probably be ds6 and jmap -heap shows: concurrent mark-sweep generation: capacity = 10398531584 (9916.8125MB) used = 9036165000 (8617.558479309082MB) free = 1362366584 (1299.254020690918MB) 86.89847145248619% used So we are using 86% of the 10GB heep allocated to the concurrent mark and sweep generation. Looking at ds6 in the web interface where has information about the a tasks it isn't running rpc stuff it doesn't show any compactions or any background tasks happening. Nor is there any active rpc call that are longer than 0 seconds (it seems to be handling the requests just fine). At this point I feel somewhat lost as to how to debug the problem. I'm not sure what to do next to figure out what is going on. Any suggestions as to what to look for or debug where the memory is being used? I can generate heap dumps via jmap (although it effectively kills the region server) but I don't really know what to look for to see where the memory is going. I also have jmx setup on each region server and can connect to it that way. Thanks, ~Jeff Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-10-29, 22:55
-
Re: Struggling with Region Servers Running out of MemoryStack 2012-10-31, 05:39
On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote:
> However what we are seeing is that our memory usage goes up slowly until the > region server starts sputtering due to gc collection issues and it will > eventually get timed out by zookeeper and be killed. > Hey Jeff. You have GC logging enabled? Might not tell you more than you already know, that something is retaining more and more objects over time. You have a dumped heap? What have you used to poke at it? You generally want to find the objects that have the deepest size (Not all profilers let you do this though). This is usually enough to give you a clue. Anything particular about the character of your load? Ram asks if any big cells in the mix? St.Ack > At this point I feel somewhat lost as to how to debug the problem. I'm not > sure what to do next to figure out what is going on. Any suggestions as to > what to look for or debug where the memory is being used? I can generate > heap dumps via jmap (although it effectively kills the region server) but I > don't really know what to look for to see where the memory is going. I also > have jmx setup on each region server and can connect to it that way. > > Thanks, > ~Jeff > > -- > Jeff Whiting > Qualtrics Senior Software Engineer > [EMAIL PROTECTED] > +
Stack 2012-10-31, 05:39
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-11-01, 17:01
We don't have GC logging enabled (we did but gc.log would begin filling up the hdd and there was no
way to clear it out without restarting the region server). Anyway to en gc.log and keep it to a reasonable size? I have two separate jmap dumps of the a region server before it dies. I haven't really looked into those yet. I'll try to do that today. I've typically used eclipse memory analyzer tool or netbeans. Is there a profiler you'd recommend? ~Jeff On 10/30/2012 11:39 PM, Stack wrote: > On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote: >> However what we are seeing is that our memory usage goes up slowly until the >> region server starts sputtering due to gc collection issues and it will >> eventually get timed out by zookeeper and be killed. >> > Hey Jeff. You have GC logging enabled? Might not tell you more than > you already know, that something is retaining more and more objects > over time. You have a dumped heap? What have you used to poke at > it? You generally want to find the objects that have the deepest size > (Not all profilers let you do this though). This is usually enough to > give you a clue. > > Anything particular about the character of your load? Ram asks if any > big cells in the mix? > > St.Ack > > > >> At this point I feel somewhat lost as to how to debug the problem. I'm not >> sure what to do next to figure out what is going on. Any suggestions as to >> what to look for or debug where the memory is being used? I can generate >> heap dumps via jmap (although it effectively kills the region server) but I >> don't really know what to look for to see where the memory is going. I also >> have jmx setup on each region server and can connect to it that way. >> >> Thanks, >> ~Jeff >> >> -- >> Jeff Whiting >> Qualtrics Senior Software Engineer >> [EMAIL PROTECTED] >> -- Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-11-01, 17:01
-
Re: Struggling with Region Servers Running out of MemoryJeremy Carroll 2012-11-01, 17:07
Java 6 update 34 can rotate GC Logs. -XX:+UseGCLogFileRotation
http://stackoverflow.com/questions/3822097/rolling-garbage-collector-logs-in-java As for profiling memory dumps, jprofiler7, yourrkit, etc.. YMMV. On Thu, Nov 1, 2012 at 10:01 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote: > We don't have GC logging enabled (we did but gc.log would begin filling up > the hdd and there was no way to clear it out without restarting the region > server). Anyway to en gc.log and keep it to a reasonable size? > > I have two separate jmap dumps of the a region server before it dies. I > haven't really looked into those yet. I'll try to do that today. I've > typically used eclipse memory analyzer tool or netbeans. Is there a > profiler you'd recommend? > > ~Jeff > > On 10/30/2012 11:39 PM, Stack wrote: > >> On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> >> wrote: >> >>> However what we are seeing is that our memory usage goes up slowly until >>> the >>> region server starts sputtering due to gc collection issues and it will >>> eventually get timed out by zookeeper and be killed. >>> >>> Hey Jeff. You have GC logging enabled? Might not tell you more than >> you already know, that something is retaining more and more objects >> over time. You have a dumped heap? What have you used to poke at >> it? You generally want to find the objects that have the deepest size >> (Not all profilers let you do this though). This is usually enough to >> give you a clue. >> >> Anything particular about the character of your load? Ram asks if any >> big cells in the mix? >> >> St.Ack >> >> >> >> >> At this point I feel somewhat lost as to how to debug the problem. I'm >>> not >>> sure what to do next to figure out what is going on. Any suggestions as >>> to >>> what to look for or debug where the memory is being used? I can generate >>> heap dumps via jmap (although it effectively kills the region server) >>> but I >>> don't really know what to look for to see where the memory is going. I >>> also >>> have jmx setup on each region server and can connect to it that way. >>> >>> Thanks, >>> ~Jeff >>> >>> -- >>> Jeff Whiting >>> Qualtrics Senior Software Engineer >>> [EMAIL PROTECTED] >>> >>> > -- > Jeff Whiting > Qualtrics Senior Software Engineer > [EMAIL PROTECTED] > > +
Jeremy Carroll 2012-11-01, 17:07
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-11-01, 19:25
Good to know. Its nice they finally got that in. We aren't on u36 right now in production but I'm
going to push on getting us there. Thanks, ~Jeff On 11/1/2012 11:07 AM, Jeremy Carroll wrote: > Java 6 update 34 can rotate GC Logs. -XX:+UseGCLogFileRotation > > http://stackoverflow.com/questions/3822097/rolling-garbage-collector-logs-in-java > > As for profiling memory dumps, jprofiler7, yourrkit, etc.. YMMV. > > > On Thu, Nov 1, 2012 at 10:01 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote: > >> We don't have GC logging enabled (we did but gc.log would begin filling up >> the hdd and there was no way to clear it out without restarting the region >> server). Anyway to en gc.log and keep it to a reasonable size? >> >> I have two separate jmap dumps of the a region server before it dies. I >> haven't really looked into those yet. I'll try to do that today. I've >> typically used eclipse memory analyzer tool or netbeans. Is there a >> profiler you'd recommend? >> >> ~Jeff >> >> On 10/30/2012 11:39 PM, Stack wrote: >> >>> On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> >>> wrote: >>> >>>> However what we are seeing is that our memory usage goes up slowly until >>>> the >>>> region server starts sputtering due to gc collection issues and it will >>>> eventually get timed out by zookeeper and be killed. >>>> >>>> Hey Jeff. You have GC logging enabled? Might not tell you more than >>> you already know, that something is retaining more and more objects >>> over time. You have a dumped heap? What have you used to poke at >>> it? You generally want to find the objects that have the deepest size >>> (Not all profilers let you do this though). This is usually enough to >>> give you a clue. >>> >>> Anything particular about the character of your load? Ram asks if any >>> big cells in the mix? >>> >>> St.Ack >>> >>> >>> >>> >>> At this point I feel somewhat lost as to how to debug the problem. I'm >>>> not >>>> sure what to do next to figure out what is going on. Any suggestions as >>>> to >>>> what to look for or debug where the memory is being used? I can generate >>>> heap dumps via jmap (although it effectively kills the region server) >>>> but I >>>> don't really know what to look for to see where the memory is going. I >>>> also >>>> have jmx setup on each region server and can connect to it that way. >>>> >>>> Thanks, >>>> ~Jeff >>>> >>>> -- >>>> Jeff Whiting >>>> Qualtrics Senior Software Engineer >>>> [EMAIL PROTECTED] >>>> >>>> >> -- >> Jeff Whiting >> Qualtrics Senior Software Engineer >> [EMAIL PROTECTED] >> >> -- Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-11-01, 19:25
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-11-01, 23:44
So this is some of what I'm seeing as I go through the profiles:
(a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache This looks like it is the block cache and we aren't having any problems with that... (b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer -- java.util.concurrent.ConcurrentHashMap$Segment[] It looks like it belongs to the member variable "onlineRegions" which has a member variable "segments". I'm guessing this is the memstores that hbase is currently holding onto. (c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue -- org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server This is the one that keeps growing and not shrinking that is causing us to run out of memory. However the cause isn't immediately clear like the other 2 in MAT. These seem to be the references to the LinkedBlockingQueue (you'll need a wide monitor to read it well): Class Name | Shallow Heap | Retained Heap --------------------------------------------------------------------------------------------------------------------------------------------------------- java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 | 80 | 4,616,431,568 |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3c70 REPL IPC Server handler 2 on 60020 Thread| 192 | 384,392 |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3d30 REPL IPC Server handler 1 on 60020 Thread| 192 | 384,392 |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3df0 REPL IPC Server handler 0 on 60020 Thread| 192 | 205,976 |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ 0x2aaab392dbe0 | 240 | 3,968 --------------------------------------------------------------------------------------------------------------------------------------------------------- So it looks like it is shared between the myCallQueue and the replicationQueue. JProfiler is showing the same thing. I'm having a hard time figuring out much more. (d) 977MB -- In other (no common root) This just seems to be other stuff going on in the region server but I'm not really concerned about it...as I don't think it is the culprit. Overall it looks like it has to do with replication. So this cluster is in the middle of an replication chain A -> B -> C where this cluster is B. So can we tell if it is running out of memory because it is being replicated too? Or because it is trying to replicate somewhere else. Thanks, ~Jeff On 10/30/2012 11:39 PM, Stack wrote: > On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote: >> However what we are seeing is that our memory usage goes up slowly until the >> region server starts sputtering due to gc collection issues and it will >> eventually get timed out by zookeeper and be killed. >> > Hey Jeff. You have GC logging enabled? Might not tell you more than > you already know, that something is retaining more and more objects > over time. You have a dumped heap? What have you used to poke at > it? You generally want to find the objects that have the deepest size > (Not all profilers let you do this though). This is usually enough to > give you a clue. > > Anything particular about the character of your load? Ram asks if any > big cells in the mix? > > St.Ack > > > >> At this point I feel somewhat lost as to how to debug the problem. I'm not >> sure what to do next to figure out what is going on. Any suggestions as to >> what to look for or debug where the memory is being used? I can generate >> heap dumps via jmap (although it effectively kills the region server) but I >> don't really know what to look for to see where the memory is going. I also >> have jmx setup on each region server and can connect to it that way. >> >> Thanks, >> ~Jeff >> >> -- >> Jeff Whiting >> Qualtrics Senior Software Engineer >> [EMAIL PROTECTED] >> -- Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-11-01, 23:44
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-11-02, 00:53
Also can any of the other call queues just fill up for ever and cause OOME as well? I don't see any
code the limits the queue size based off of the amount of memory they are using so it seems like any of them (priorityCallQueue, the replicaitonQueue or the callQueue which are all in the HBaseServer) could suffer from the same problem I'm seeing in replicationQueue. Some thought may need to be put into how those queues are handled, how big we allow them to get, and when to block / drop the call rather than run out of memory. It does look like the queue does have a max number of items in it as defined by (ipc.server.max.callqueue.size). ~Jeff On 11/1/2012 5:44 PM, Jeff Whiting wrote: > So this is some of what I'm seeing as I go through the profiles: > > (a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache > This looks like it is the block cache and we aren't having any problems with that... > > (b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer -- > java.util.concurrent.ConcurrentHashMap$Segment[] > It looks like it belongs to the member variable "onlineRegions" which has a member variable > "segments". > I'm guessing this is the memstores that hbase is currently holding onto. > > (c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue -- > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server > This is the one that keeps growing and not shrinking that is causing us to run out of memory. > However the cause isn't immediately clear like the other 2 in MAT. > These seem to be the references to the LinkedBlockingQueue (you'll need a wide monitor to read > it well): > Class Name | Shallow Heap | Retained Heap > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 | 80 | 4,616,431,568 > |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3c70 REPL IPC Server > handler 2 on 60020 Thread| 192 | 384,392 > |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3d30 REPL IPC Server > handler 1 on 60020 Thread| 192 | 384,392 > |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3df0 REPL IPC Server > handler 0 on 60020 Thread| 192 | 205,976 > |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ > 0x2aaab392dbe0 | 240 | 3,968 > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > > So it looks like it is shared between the myCallQueue and the replicationQueue. JProfiler is > showing the same thing. I'm having a hard time figuring out much more. > > (d) 977MB -- In other (no common root) > This just seems to be other stuff going on in the region server but I'm not really concerned > about it...as I don't think it is the culprit. > > > Overall it looks like it has to do with replication. So this cluster is in the middle of an > replication chain A -> B -> C where this cluster is B. So can we tell if it is running out of > memory because it is being replicated too? Or because it is trying to replicate somewhere else. > > Thanks, > ~Jeff > > On 10/30/2012 11:39 PM, Stack wrote: >> On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote: >>> However what we are seeing is that our memory usage goes up slowly until the >>> region server starts sputtering due to gc collection issues and it will >>> eventually get timed out by zookeeper and be killed. >>> >> Hey Jeff. You have GC logging enabled? Might not tell you more than >> you already know, that something is retaining more and more objects >> over time. You have a dumped heap? What have you used to poke at >> it? You generally want to find the objects that have the deepest size >> (Not all profilers let you do this though). This is usually enough to Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-11-02, 00:53
-
Re: Struggling with Region Servers Running out of MemoryJean-Daniel Cryans 2012-11-05, 19:50
Jiras like https://issues.apache.org/jira/browse/HBASE-5190 try to fix
the call queues by making them more based on the size of the payload. Regarding your issue, it sounds like replication is blocked eg if the 3 IPC threads can't write and are waiting on some condition then newer calls will just pile up until the call queue is full. Considering that the default max size for replication packets is 64MB, you wouldn't need that many of them to fill up your heap. You need to find what's blocking replication, the RS log should be vocal about what's wrong. You might also want to lower the size of the replication calls, change replication.source.size.capacity on the cluster A. In 0.94.2 there are a number of tweaks to make the replication threads less blocking so rest assured that it's getting better. J-D On Thu, Nov 1, 2012 at 5:53 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote: > Also can any of the other call queues just fill up for ever and cause OOME > as well? I don't see any code the limits the queue size based off of the > amount of memory they are using so it seems like any of them > (priorityCallQueue, the replicaitonQueue or the callQueue which are all in > the HBaseServer) could suffer from the same problem I'm seeing in > replicationQueue. Some thought may need to be put into how those queues are > handled, how big we allow them to get, and when to block / drop the call > rather than run out of memory. It does look like the queue does have a max > number of items in it as defined by (ipc.server.max.callqueue.size). > > > ~Jeff > > On 11/1/2012 5:44 PM, Jeff Whiting wrote: >> >> So this is some of what I'm seeing as I go through the profiles: >> >> (a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache >> This looks like it is the block cache and we aren't having any >> problems with that... >> >> (b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer -- >> java.util.concurrent.ConcurrentHashMap$Segment[] >> It looks like it belongs to the member variable "onlineRegions" which >> has a member variable "segments". >> I'm guessing this is the memstores that hbase is currently holding >> onto. >> >> (c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue -- >> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server >> This is the one that keeps growing and not shrinking that is causing us >> to run out of memory. However the cause isn't immediately clear like the >> other 2 in MAT. >> These seem to be the references to the LinkedBlockingQueue (you'll need >> a wide monitor to read it well): >> Class Name | Shallow Heap | Retained Heap >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------- >> java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 | 80 | >> 4,616,431,568 >> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ >> 0x2aaab50d3c70 REPL IPC Server handler 2 on 60020 Thread| 192 | 384,392 >> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ >> 0x2aaab50d3d30 REPL IPC Server handler 1 on 60020 Thread| 192 | 384,392 >> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ >> 0x2aaab50d3df0 REPL IPC Server handler 0 on 60020 Thread| 192 | 205,976 >> |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ >> 0x2aaab392dbe0 | 240 | >> 3,968 >> >> --------------------------------------------------------------------------------------------------------------------------------------------------------- >> >> So it looks like it is shared between the myCallQueue and the >> replicationQueue. JProfiler is showing the same thing. I'm having a hard >> time figuring out much more. >> >> (d) 977MB -- In other (no common root) >> This just seems to be other stuff going on in the region server but >> I'm not really concerned about it...as I don't think it is the culprit. >> >> >> Overall it looks like it has to do with replication. So this cluster is +
Jean-Daniel Cryans 2012-11-05, 19:50
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-11-02, 00:44
Ok so I'm looking through the code. It looks like in HBaseServer.java it will create a
replicationQueue if hbase.regionserver.replication.handler.count > 0. We haven't changed that so the default is 3. The replicationQueue is then shared with handlers. Then in processData(byte[] buf) if it is a replication call it puts it in the replicationQueue. So when cluster A is replicating to cluster B and cluster B isn't keeping up does the replicationQueue just fill up until it runs out of memory? It seems like it should rate limit or only send new edits once they old ones have executed. I'm a little hazy when processData is called and how it fits in the whole replication pipeline. Since the region servers are just replaying wal logs to do the replication it seems like the memory footprint could be made to be very minimal. ~Jeff On 11/1/2012 5:44 PM, Jeff Whiting wrote: > So this is some of what I'm seeing as I go through the profiles: > > (a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache > This looks like it is the block cache and we aren't having any problems with that... > > (b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer -- > java.util.concurrent.ConcurrentHashMap$Segment[] > It looks like it belongs to the member variable "onlineRegions" which has a member variable > "segments". > I'm guessing this is the memstores that hbase is currently holding onto. > > (c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue -- > org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server > This is the one that keeps growing and not shrinking that is causing us to run out of memory. > However the cause isn't immediately clear like the other 2 in MAT. > These seem to be the references to the LinkedBlockingQueue (you'll need a wide monitor to read > it well): > Class Name | Shallow Heap | Retained Heap > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 | 80 | 4,616,431,568 > |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3c70 REPL IPC Server > handler 2 on 60020 Thread| 192 | 384,392 > |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3d30 REPL IPC Server > handler 1 on 60020 Thread| 192 | 384,392 > |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @ 0x2aaab50d3df0 REPL IPC Server > handler 0 on 60020 Thread| 192 | 205,976 > |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @ > 0x2aaab392dbe0 | 240 | 3,968 > --------------------------------------------------------------------------------------------------------------------------------------------------------- > > > So it looks like it is shared between the myCallQueue and the replicationQueue. JProfiler is > showing the same thing. I'm having a hard time figuring out much more. > > (d) 977MB -- In other (no common root) > This just seems to be other stuff going on in the region server but I'm not really concerned > about it...as I don't think it is the culprit. > > > Overall it looks like it has to do with replication. So this cluster is in the middle of an > replication chain A -> B -> C where this cluster is B. So can we tell if it is running out of > memory because it is being replicated too? Or because it is trying to replicate somewhere else. > > Thanks, > ~Jeff > > On 10/30/2012 11:39 PM, Stack wrote: >> On Mon, Oct 29, 2012 at 3:55 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote: >>> However what we are seeing is that our memory usage goes up slowly until the >>> region server starts sputtering due to gc collection issues and it will >>> eventually get timed out by zookeeper and be killed. >>> >> Hey Jeff. You have GC logging enabled? Might not tell you more than >> you already know, that something is retaining more and more objects Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-11-02, 00:44
-
Re: Struggling with Region Servers Running out of Memoryramkrishna vasudevan 2012-10-30, 06:43
Hi
Are you using any coprocessors? Can you see how many store files are created? The no of blocks getting cached will give you an idea too.. Regards Ram On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote: > We have 6 region server given 10G of memory for hbase. Each region server > has an average of about 100 regions and across the cluster we are averaging > about 100 requests / second with a pretty even read / write load. We are > running cdh4 (0.92.1-cdh4.0.1, rUnknown) > > I feel that looking over our load and our requests that the 10GB of memory > should be enough to handle the load and that we shouldn't really be pushing > the the memory limits. > > However what we are seeing is that our memory usage goes up slowly until > the region server starts sputtering due to gc collection issues and it will > eventually get timed out by zookeeper and be killed. > > We'll see aborts like this in the log: > 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547: > Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException: > Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net,60020,**1351233245547 > as dead server > 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > RegionServer abort: loaded coprocessors are: [] > 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547: > regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** > 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf > regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** > 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received > expired from ZooKeeper, aborting > 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > RegionServer abort: loaded coprocessors are: [] > > Which are "caused" by: > 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We > slept 29014ms instead of 3000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see http://hbase.apache.org/book.** > html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We > slept 28121ms instead of 3000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see http://hbase.apache.org/book.** > html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We > slept 31124ms instead of 3000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see http://hbase.apache.org/book.** > html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We > slept 32209ms instead of 3000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see http://hbase.apache.org/book.** > html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We > slept 32557ms instead of 3000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see http://hbase.apache.org/book.** > html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We > slept 33741ms instead of 3000ms, this is likely due to a long garbage > collecting pause and it's usually bad, see http://hbase.apache.org/book.** > html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> > > > We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks +
ramkrishna vasudevan 2012-10-30, 06:43
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-10-31, 00:21
We have no coprossesors. We are running replication from this cluster to another one.
What is the best way to see how many store files we have? Or checking on the block cache? ~Jeff On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote: > Hi > > Are you using any coprocessors? Can you see how many store files are > created? > > The no of blocks getting cached will give you an idea too.. > > Regards > Ram > > On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote: > >> We have 6 region server given 10G of memory for hbase. Each region server >> has an average of about 100 regions and across the cluster we are averaging >> about 100 requests / second with a pretty even read / write load. We are >> running cdh4 (0.92.1-cdh4.0.1, rUnknown) >> >> I feel that looking over our load and our requests that the 10GB of memory >> should be enough to handle the load and that we shouldn't really be pushing >> the the memory limits. >> >> However what we are seeing is that our memory usage goes up slowly until >> the region server starts sputtering due to gc collection issues and it will >> eventually get timed out by zookeeper and be killed. >> >> We'll see aborts like this in the log: >> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547: >> Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException: >> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net,60020,**1351233245547 >> as dead server >> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >> RegionServer abort: loaded coprocessors are: [] >> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547: >> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** >> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf >> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** >> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received >> expired from ZooKeeper, aborting >> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >> RegionServer abort: loaded coprocessors are: [] >> >> Which are "caused" by: >> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We >> slept 29014ms instead of 3000ms, this is likely due to a long garbage >> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We >> slept 28121ms instead of 3000ms, this is likely due to a long garbage >> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We >> slept 31124ms instead of 3000ms, this is likely due to a long garbage >> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >> 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We >> slept 32209ms instead of 3000ms, this is likely due to a long garbage >> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >> 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We >> slept 32557ms instead of 3000ms, this is likely due to a long garbage >> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >> 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-10-31, 00:21
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-10-31, 00:40
So I'm looking at ganglia so the numbers are somewhat approximate (this is for a server that just
crashed about an 1/2 hour ago due to running out of memory): Store files are hovering just below 1k. Over the last 24 hours it has varied by about 100 files (I'm looking at hbase.regionserver.storefiles). Block cache count is about 24k varied by about 2k. Our block cache free goes between 0.7G and 0.4G. It looks like we have almost 3G free after restarting a region server. The evicted block count went from 210k to 320k over a 24 hour period. Hit ratio is close to 100 (the graph isn't very detailed so I'm guess it is like 98-99%). Block cache size stays at about 2GB. ~Jeff On 10/30/2012 6:21 PM, Jeff Whiting wrote: > We have no coprossesors. We are running replication from this cluster to another one. > > What is the best way to see how many store files we have? Or checking on the block cache? > > ~Jeff > > On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote: >> Hi >> >> Are you using any coprocessors? Can you see how many store files are >> created? >> >> The no of blocks getting cached will give you an idea too.. >> >> Regards >> Ram >> >> On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote: >> >>> We have 6 region server given 10G of memory for hbase. Each region server >>> has an average of about 100 regions and across the cluster we are averaging >>> about 100 requests / second with a pretty even read / write load. We are >>> running cdh4 (0.92.1-cdh4.0.1, rUnknown) >>> >>> I feel that looking over our load and our requests that the 10GB of memory >>> should be enough to handle the load and that we shouldn't really be pushing >>> the the memory limits. >>> >>> However what we are seeing is that our memory usage goes up slowly until >>> the region server starts sputtering due to gc collection issues and it will >>> eventually get timed out by zookeeper and be killed. >>> >>> We'll see aborts like this in the log: >>> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >>> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547: >>> Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException: >>> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net,60020,**1351233245547 >>> as dead server >>> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >>> RegionServer abort: loaded coprocessors are: [] >>> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >>> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547: >>> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** >>> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf >>> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** >>> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received >>> expired from ZooKeeper, aborting >>> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: >>> RegionServer abort: loaded coprocessors are: [] >>> >>> Which are "caused" by: >>> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We >>> slept 29014ms instead of 3000ms, this is likely due to a long garbage >>> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >>> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We >>> slept 28121ms instead of 3000ms, this is likely due to a long garbage >>> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >>> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We >>> slept 31124ms instead of 3000ms, this is likely due to a long garbage >>> collecting pause and it's usually bad, see http://hbase.apache.org/book.** >>> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-10-31, 00:40
-
Re: Struggling with Region Servers Running out of Memoryramkrishna vasudevan 2012-10-31, 04:45
Are you writing fat cells?
Did you try raising the heap size? and see if still it is crashing? Regards Ram On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote: > So I'm looking at ganglia so the numbers are somewhat approximate (this is > for a server that just crashed about an 1/2 hour ago due to running out of > memory): > > Store files are hovering just below 1k. Over the last 24 hours it has > varied by about 100 files (I'm looking at hbase.regionserver.storefiles)** > . > > Block cache count is about 24k varied by about 2k. Our block cache free > goes between 0.7G and 0.4G. It looks like we have almost 3G free after > restarting a region server. > > The evicted block count went from 210k to 320k over a 24 hour period. Hit > ratio is close to 100 (the graph isn't very detailed so I'm guess it is > like 98-99%). > > Block cache size stays at about 2GB. > > ~Jeff > > > > On 10/30/2012 6:21 PM, Jeff Whiting wrote: > >> We have no coprossesors. We are running replication from this cluster to >> another one. >> >> What is the best way to see how many store files we have? Or checking on >> the block cache? >> >> ~Jeff >> >> On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote: >> >>> Hi >>> >>> Are you using any coprocessors? Can you see how many store files are >>> created? >>> >>> The no of blocks getting cached will give you an idea too.. >>> >>> Regards >>> Ram >>> >>> On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED]> >>> wrote: >>> >>> We have 6 region server given 10G of memory for hbase. Each region >>>> server >>>> has an average of about 100 regions and across the cluster we are >>>> averaging >>>> about 100 requests / second with a pretty even read / write load. We >>>> are >>>> running cdh4 (0.92.1-cdh4.0.1, rUnknown) >>>> >>>> I feel that looking over our load and our requests that the 10GB of >>>> memory >>>> should be enough to handle the load and that we shouldn't really be >>>> pushing >>>> the the memory limits. >>>> >>>> However what we are seeing is that our memory usage goes up slowly until >>>> the region server starts sputtering due to gc collection issues and it >>>> will >>>> eventually get timed out by zookeeper and be killed. >>>> >>>> We'll see aborts like this in the log: >>>> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547: >>>> Unhandled exception: org.apache.hadoop.hbase.****YouAreDeadException: >>>> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net >>>> ,60020,****1351233245547 >>>> as dead server >>>> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> RegionServer abort: loaded coprocessors are: [] >>>> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547: >>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-** >>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf >>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-** >>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf received >>>> expired from ZooKeeper, aborting >>>> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**** >>>> regionserver.HRegionServer: >>>> RegionServer abort: loaded coprocessors are: [] >>>> >>>> Which are "caused" by: >>>> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.****Sleeper: >>>> We >>>> slept 29014ms instead of 3000ms, this is likely due to a long garbage >>>> collecting pause and it's usually bad, see >>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**> >>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.** >>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired> >>>> > >>>> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.****Sleeper: +
ramkrishna vasudevan 2012-10-31, 04:45
-
Re: Struggling with Region Servers Running out of MemoryJeff Whiting 2012-11-01, 15:14
No fat rows. We have kept the default hbase client limit of 10mb. And most values are quite small < 5k.
We haven't tried raising the memory limit and we can try raising one of the servers and see how it does. However looking at the graphs I don't think it will help...but it is worth a try. ~Jeff On 10/30/2012 10:45 PM, ramkrishna vasudevan wrote: > Are you writing fat cells? > > Did you try raising the heap size? and see if still it is crashing? > > Regards > Ram > > On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> > wrote: > > So I'm looking at ganglia so the numbers are somewhat approximate (this is for a server that > just crashed about an 1/2 hour ago due to running out of memory): > > Store files are hovering just below 1k. Over the last 24 hours it has varied by about 100 > files (I'm looking at hbase.regionserver.storefiles). > > Block cache count is about 24k varied by about 2k. Our block cache free goes between 0.7G and > 0.4G. It looks like we have almost 3G free after restarting a region server. > > The evicted block count went from 210k to 320k over a 24 hour period. Hit ratio is close to > 100 (the graph isn't very detailed so I'm guess it is like 98-99%). > > Block cache size stays at about 2GB. > > ~Jeff > > > > On 10/30/2012 6:21 PM, Jeff Whiting wrote: > > We have no coprossesors. We are running replication from this cluster to another one. > > What is the best way to see how many store files we have? Or checking on the block cache? > > ~Jeff > > On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote: > > Hi > > Are you using any coprocessors? Can you see how many store files are > created? > > The no of blocks getting cached will give you an idea too.. > > Regards > Ram > > On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > We have 6 region server given 10G of memory for hbase. Each region server > has an average of about 100 regions and across the cluster we are averaging > about 100 requests / second with a pretty even read / write load. We are > running cdh4 (0.92.1-cdh4.0.1, rUnknown) > > I feel that looking over our load and our requests that the 10GB of memory > should be enough to handle the load and that we shouldn't really be pushing > the the memory limits. > > However what we are seeing is that our memory usage goes up slowly until > the region server starts sputtering due to gc collection issues and it will > eventually get timed out by zookeeper and be killed. > > We'll see aborts like this in the log: > 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > ABORTING region server ds5.h1.ut1.qprod.net > <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547: > Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException: > Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net > <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547 > as dead server > 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > RegionServer abort: loaded coprocessors are: [] > 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer: > ABORTING region server ds5.h1.ut1.qprod.net > <http://ds5.h1.ut1.qprod.net>,60020,**1351233245547: > regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-** > 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf Jeff Whiting Qualtrics Senior Software Engineer [EMAIL PROTECTED] +
Jeff Whiting 2012-11-01, 15:14
|