I took a look at our logs and I don't see anywhere that we see that kind of
vicious cycle of gcs. The only thing I noted it that we have a very high
new ratio of 16 (I believe the default is 2 for AMD64... and I can't
remember why now that we have that set that way). For reference, the
current settings that work for us with a 12gb heap:
-XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:NewRatio=16
I'm not sure what more I can suggest. I wish I remembered more about this
than I do. The Sun/Oracle guys have put out a couple of presentations at
past JavaOne's that do a lot to describe gc tuning and particular symptoms.
I remember recently by Charlie Hunt that was super useful. Looked
something like this:
Note, we actually don't monitor request latency because the vast majority
of our work is batch based (and thus we're focused on throughput). We also
use very little block cache. We also have way to many regions (based on
benchmarks provided by others), averaging about 700/server. We also have
much larger object sizes if your average only 200 bytes...
On Tue, Dec 6, 2011 at 6:30 PM, Derek Wollenstein <[EMAIL PROTECTED]> wrote:
> Jacques --
> The main problem we really see is that after "a while"(yes, this is
> unspecific) we start seeing "random" (e.g. uncorrelated with the key being
> retrieved, but correlated with time) timeouts (>500ms) retrieving small (<
> 200 byte) rows from hbase. The interesting thing is that even a test case
> that seems like it should be very cacheable (retrieving literally the same
> row, over and over again) will have the same responsiveness. I was just
> suspecting that if we're running 6 GC/second (which is what this hits at
> "peak GC"), each at 25-35ms each, that could certainly explain a decent
> amount of unavailability.
> 1) We don't yet have MSLAB enabled, primarily because we're trying to turn
> on settings changes one at a time. I also wasn't clear if these symptoms
> were consistent with the ones mslab was trying to fix, although I have
> considered it.
> 2) I don't think a four minute pause is good, but I don't think a 120
> second pause is that great either, so it's something we'd have to worry
> 3) Yes, we have 8 physical cores (16 HT cores), so that shouldn't be the
> 4) Off heap block cache does seem like a cool feature, I will keep it in
> The main reason I was asking is that we really did see this sudden "jump"
> in GC activity over time, and I was hoping it indicated something like
> setting the block cache setting too high relative to the heap size...
> On Tue, Dec 6, 2011 at 6:14 PM, Jacques <[EMAIL PROTECTED]> wrote:
> > I'll start with clearly stating that I'm not a gc specialist. I spend a
> > bunch of time with it but forget all the things I learn once I solve my
> > problems...
> > What exactly is the problem here? Does the server become unresponsive
> > after 16 hours? What happens in the HBase logs for that regionserver? I
> > believe that you're seeing frequent runs likely because of fragmentation
> > your heap along with your XX:CMSInitiatingOccupancyFraction of 60%.
> > would be a precursor to a full gc which would likely actually take the
> > server down.
> > A few quick thoughts that you may or may not have run across:
> > - MSLAB is your friend if you haven't been using it already. See more
> > here:
> > - I can't remember exactly but I feel like the number that used to be
> > quoted by some was 10 seconds per gb for a full gc. So you're looking
> > at a
> > full gc of over ~4 minutes with that size heap once you do arrive at a
> > full
> > gc.
> > - If you're okay having unresponsive regions for 4+minutes, you'd also