Make sure you add some limit to the New Generation size.  We have
"-XX:NewSize=500m -XX:MaxNewSize=500m " in the 3G version of  You can go larger than 500m, but try to keep it
small (~1G).

Look for evidence of a stop-the-world java garbage collection.

1) look for "gc" lines in the tablet server logs:

 $ grep gc logs/tserver*.debug.log

You should see one line every second for a busy server.  Big delays
between gc lines are good evidence of a stop-the-world gc.

2) big leaps in gc

Looking at the GC lines, if you recover several gigabytes in one
collect, I have seen that just before zookeeper disconnects.  You can
reduce the -XX:CMSInitiatingOccupancyFraction=75 to something even
smaller.  I'm just guessing about this one... I suspect the OS is
putting us in swap.

3) Swap... maybe?

I know you have no swap, but the OS could be not giving you the pages
you want if they are unused.  Maybe this isn't possible: my low-level
understanding of how the page cache works is almost non-existent.
I've seen large, mostly idle tservers lose their locks while doing gc.
 This does not happen if we flush OS buffers periodically, ensuring
that free RAM is plentiful. Of course, this hurts performance of the
file system.

4) make sure you are using the native map.


On Mon, Apr 14, 2014 at 2:59 PM, Frans Lawaetz <[EMAIL PROTECTED]> wrote:

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB