A few things pop out to me on cursory glance:
- You are using CMSIncrementalMode which after a long chain of events has a tendency to result in the famous Juliet pause of death. Can you try Par New GC instead and see if that helps?
- You should try to reduce the CMSInitiatingOccupancyFraction to avoid a full GC
- Your hbase-env.sh is not setting the Xmx at all. Do you know how much RAM you are giving to your region servers? It may be too small or too large given your use case and machines size
- Your client scanner caching is 1 which may be too small depending on your row sizes. You can also override that setting in your scan for the MR job
- You only have 2 zookeeper instances which is not at all recommended. Zookeeper needs a quorum to operate and generally works best with an odd number of zookeeper servers. This probably isn't related to your crashes but it would help stability if you had 1 or 3 zookeepers
- I am not 100% sure if the version of hbase you are using has mslab enabled. If not you should enable it.
- You can try increasing/decreasing the amount of RAM you provide to block caches and memstores to suit your use case. I see that you are using the defaults here
On top of these, when you kick off your MR job to scan HBase you should setCacheBlocks to false
From: Flavio Pompermaier <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]; Dhaval Shah <[EMAIL PROTECTED]>
Sent: Friday, 23 May 2014 3:16 AM
Subject: Re: HBase cluster design
The hardware specs are: 4 nodes with 48g RAM, 24 cores and 1 TB disk each server
Attached my hbase config files.
On Fri, May 23, 2014 at 3:33 AM, Dhaval Shah <[EMAIL PROTECTED]> wrote:
Can you share your hbase-env.sh and hbase-site.xml? And hardware specs of your cluster?