Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Struggling with Region Servers Running out of Memory


+
Jeff Whiting 2012-10-29, 22:55
+
Stack 2012-10-31, 05:39
+
Jeff Whiting 2012-11-01, 17:01
+
Jeremy Carroll 2012-11-01, 17:07
+
Jeff Whiting 2012-11-01, 19:25
+
Jeff Whiting 2012-11-01, 23:44
+
Jeff Whiting 2012-11-02, 00:53
+
Jean-Daniel Cryans 2012-11-05, 19:50
+
Jeff Whiting 2012-11-02, 00:44
+
ramkrishna vasudevan 2012-10-30, 06:43
+
Jeff Whiting 2012-10-31, 00:21
+
Jeff Whiting 2012-10-31, 00:40
Copy link to this message
-
Re: Struggling with Region Servers Running out of Memory
ramkrishna vasudevan 2012-10-31, 04:45
Are you writing fat cells?

Did you try raising the heap size? and see if still it is crashing?

Regards
Ram

On Wed, Oct 31, 2012 at 6:10 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote:

> So I'm looking at ganglia so the numbers are somewhat approximate (this is
> for a server that just crashed about an 1/2 hour ago due to running out of
> memory):
>
> Store files are hovering just below 1k.  Over the last 24 hours it has
> varied by about 100 files (I'm looking at hbase.regionserver.storefiles)**
> .
>
> Block cache count is about 24k varied by about 2k.  Our block cache free
> goes between 0.7G and 0.4G.  It looks like we have almost 3G free after
> restarting a region server.
>
> The evicted block count went from 210k to 320k over a 24 hour period.  Hit
> ratio is close to 100 (the graph isn't very detailed so I'm guess it is
> like 98-99%).
>
> Block cache size stays at about 2GB.
>
> ~Jeff
>
>
>
> On 10/30/2012 6:21 PM, Jeff Whiting wrote:
>
>> We have no coprossesors.  We are running replication from this cluster to
>> another one.
>>
>> What is the best way to see how many store files we have? Or checking on
>> the block cache?
>>
>> ~Jeff
>>
>> On 10/30/2012 12:43 AM, ramkrishna vasudevan wrote:
>>
>>> Hi
>>>
>>> Are you using any coprocessors? Can you see how many store files are
>>> created?
>>>
>>> The no of blocks getting cached will give you an idea too..
>>>
>>> Regards
>>> Ram
>>>
>>> On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED]>
>>> wrote:
>>>
>>>  We have 6 region server given 10G of memory for hbase.  Each region
>>>> server
>>>> has an average of about 100 regions and across the cluster we are
>>>> averaging
>>>> about 100 requests / second with a pretty even read / write load.  We
>>>> are
>>>> running cdh4 (0.92.1-cdh4.0.1, rUnknown)
>>>>
>>>> I feel that looking over our load and our requests that the 10GB of
>>>> memory
>>>> should be enough to handle the load and that we shouldn't really be
>>>> pushing
>>>> the the memory limits.
>>>>
>>>> However what we are seeing is that our memory usage goes up slowly until
>>>> the region server starts sputtering due to gc collection issues and it
>>>> will
>>>> eventually get timed out by zookeeper and be killed.
>>>>
>>>> We'll see aborts like this in the log:
>>>> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547:
>>>> Unhandled exception: org.apache.hadoop.hbase.****YouAreDeadException:
>>>> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net
>>>> ,60020,****1351233245547
>>>> as dead server
>>>> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> RegionServer abort: loaded coprocessors are: []
>>>> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> ABORTING region server ds5.h1.ut1.qprod.net,60020,****1351233245547:
>>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-**
>>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf
>>>> regionserver:60020-****0x13959edd45934cf-****0x13959edd45934cf-**
>>>> 0x13959edd45934cf-****0x13959edd45934cf-****0x13959edd45934cf received
>>>> expired from ZooKeeper, aborting
>>>> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.****
>>>> regionserver.HRegionServer:
>>>> RegionServer abort: loaded coprocessors are: []
>>>>
>>>> Which are "caused" by:
>>>> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.****Sleeper:
>>>> We
>>>> slept 29014ms instead of 3000ms, this is likely due to a long garbage
>>>> collecting pause and it's usually bad, see
>>>> http://hbase.apache.org/book.**** <http://hbase.apache.org/book.**>
>>>> html#trouble.rs.runtime.****zkexpired<http://hbase.apache.**
>>>> org/book.html#trouble.rs.**runtime.zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>>>> >
>>>> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.****Sleeper:
+
Jeff Whiting 2012-11-01, 15:14