Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Struggling with Region Servers Running out of Memory


+
Jeff Whiting 2012-10-29, 22:55
+
Stack 2012-10-31, 05:39
+
Jeff Whiting 2012-11-01, 17:01
+
Jeremy Carroll 2012-11-01, 17:07
+
Jeff Whiting 2012-11-01, 19:25
+
Jeff Whiting 2012-11-01, 23:44
+
Jeff Whiting 2012-11-02, 00:53
+
Jean-Daniel Cryans 2012-11-05, 19:50
+
Jeff Whiting 2012-11-02, 00:44
Copy link to this message
-
Re: Struggling with Region Servers Running out of Memory
Hi

Are you using any coprocessors? Can you see how many store files are
created?

The no of blocks getting cached will give you an idea too..

Regards
Ram

On Tue, Oct 30, 2012 at 4:25 AM, Jeff Whiting <[EMAIL PROTECTED]> wrote:

> We have 6 region server given 10G of memory for hbase.  Each region server
> has an average of about 100 regions and across the cluster we are averaging
> about 100 requests / second with a pretty even read / write load.  We are
> running cdh4 (0.92.1-cdh4.0.1, rUnknown)
>
> I feel that looking over our load and our requests that the 10GB of memory
> should be enough to handle the load and that we shouldn't really be pushing
> the the memory limits.
>
> However what we are seeing is that our memory usage goes up slowly until
> the region server starts sputtering due to gc collection issues and it will
> eventually get timed out by zookeeper and be killed.
>
> We'll see aborts like this in the log:
> 2012-10-29 08:10:52,132 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547:
> Unhandled exception: org.apache.hadoop.hbase.**YouAreDeadException:
> Server REPORT rejected; currently processing ds5.h1.ut1.qprod.net,60020,**1351233245547
> as dead server
> 2012-10-29 08:10:52,250 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: []
> 2012-10-29 08:10:52,392 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> ABORTING region server ds5.h1.ut1.qprod.net,60020,**1351233245547:
> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf
> regionserver:60020-**0x13959edd45934cf-**0x13959edd45934cf-**
> 0x13959edd45934cf-**0x13959edd45934cf-**0x13959edd45934cf received
> expired from ZooKeeper, aborting
> 2012-10-29 08:10:52,401 FATAL org.apache.hadoop.hbase.**regionserver.HRegionServer:
> RegionServer abort: loaded coprocessors are: []
>
> Which are "caused" by:
> 2012-10-29 08:07:40,646 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 29014ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:08:39,074 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 28121ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:09:13,261 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 31124ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:09:45,536 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 32209ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:10:18,103 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 32557ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
> 2012-10-29 08:10:51,896 WARN org.apache.hadoop.hbase.util.**Sleeper: We
> slept 33741ms instead of 3000ms, this is likely due to a long garbage
> collecting pause and it's usually bad, see http://hbase.apache.org/book.**
> html#trouble.rs.runtime.**zkexpired<http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired>
>
>
> We'll also see a bunch of responseTooSlow and operationTooSlow as GC kicks
+
Jeff Whiting 2012-10-31, 00:21
+
Jeff Whiting 2012-10-31, 00:40
+
ramkrishna vasudevan 2012-10-31, 04:45
+
Jeff Whiting 2012-11-01, 15:14