Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Struggling with Region Servers Running out of Memory

Copy link to this message
Re: Struggling with Region Servers Running out of Memory
Jiras like https://issues.apache.org/jira/browse/HBASE-5190 try to fix
the call queues by making them more based on the size of the payload.

Regarding your issue, it sounds like replication is blocked eg if the
3 IPC threads can't write and are waiting on some condition then newer
calls will just pile up until the call queue is full. Considering that
the default max size for replication packets is 64MB, you wouldn't
need that many of them to fill up your heap.

You need to find what's blocking replication, the RS log should be
vocal about what's wrong.

You might also want to lower the size of the replication calls, change
replication.source.size.capacity on the cluster A.

In 0.94.2 there are a number of tweaks to make the replication threads
less blocking so rest assured that it's getting better.


On Thu, Nov 1, 2012 at 5:53 PM, Jeff Whiting <[EMAIL PROTECTED]> wrote:
> Also can any of the other call queues just fill up for ever and cause OOME
> as well?  I don't see any code the limits the queue size based off of the
> amount of memory they are using so it seems like any of them
> (priorityCallQueue, the replicaitonQueue or the callQueue which are all in
> the HBaseServer) could suffer from the same problem I'm seeing in
> replicationQueue.  Some thought may need to be put into how those queues are
> handled, how big we allow them to get, and when to block / drop the call
> rather than run out of memory.  It does look like the queue does have a max
> number of items in it as defined by (ipc.server.max.callqueue.size).
> ~Jeff
> On 11/1/2012 5:44 PM, Jeff Whiting wrote:
>> So this is some of what I'm seeing as I go through the profiles:
>> (a) 2GB - org.apache.hadoop.hbase.io.hfile.LruBlockCache
>>     This looks like it is the block cache and we aren't having any
>> problems with that...
>> (b) 1.4GB - org.apache.hadoop.hbase.regionserver.HRegionServer --
>> java.util.concurrent.ConcurrentHashMap$Segment[]
>>     It looks like it belongs to the member variable "onlineRegions" which
>> has a member variable "segments".
>>     I'm guessing this is the memstores that hbase is currently holding
>> onto.
>> (c) 4.3GB -- java.util.concurrent.LinkedBlockingQueue --
>> org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server
>>    This is the one that keeps growing and not shrinking that is causing us
>> to run out of memory.  However the cause isn't immediately clear like the
>> other 2 in MAT.
>>   These seem to be the references to the LinkedBlockingQueue (you'll need
>> a wide monitor to read it well):
>> Class Name | Shallow Heap | Retained Heap
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>> java.util.concurrent.LinkedBlockingQueue @ 0x2aaab3583d30 |           80 |
>> 4,616,431,568
>> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @
>> 0x2aaab50d3c70 REPL IPC Server handler 2 on 60020 Thread| 192 | 384,392
>> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @
>> 0x2aaab50d3d30 REPL IPC Server handler 1 on 60020 Thread| 192 | 384,392
>> |- myCallQueue org.apache.hadoop.hbase.ipc.HBaseServer$Handler @
>> 0x2aaab50d3df0 REPL IPC Server handler 0 on 60020 Thread| 192 | 205,976
>> |- replicationQueue org.apache.hadoop.hbase.ipc.WritableRpcEngine$Server @
>> 0x2aaab392dbe0                                 |          240 |
>> 3,968
>> ---------------------------------------------------------------------------------------------------------------------------------------------------------
>> So it looks like it is shared between the myCallQueue and the
>> replicationQueue.  JProfiler is showing the same thing.  I'm having a hard
>> time figuring out much more.
>> (d) 977MB -- In other (no common root)
>>     This just seems to be other stuff going on in the region server but
>> I'm not really concerned about it...as I don't think it is the culprit.
>> Overall it looks like it has to do with replication.  So this cluster is