Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # dev - 转发: seem "SafepointTimeout" doesn't work well if sync takes too long ?


Copy link to this message
-
转发: seem "SafepointTimeout" doesn't work well if sync takes too long ?
谢良 2013-04-28, 11:29
FYI.
It would be pretty helpful if other folks hit the similar issue.

Best,
Liang
________________________________
发件人: Peter Levart [[EMAIL PROTECTED]]
发送时间: 2013年4月28日 17:50
收件人: 谢良
Cc: [EMAIL PROTECTED]
主题: Re: seem "SafepointTimeout" doesn't work well if sync takes too long ?

Hi Liang,

I think we had a similar issue. If "sync" or "sync cleanup" means cleanup of the inflated Object locks (monitors), then it might be the same issue. The problem with our issue was that the number of inflated locks in circulation grew very large and the cleanup that recycles them during safepoint took very long time (program threads hardly got any CPU time). The number of inflated locks in circulation is allways the high watermark - never decreases even if demand settles down. New locks are allocated if demand grows and safe-pointing can not keep-up with recycling them. Our solution was to "limit" their number so that they never grow in millions:

-XX:MonitorBound=16384

(this is soft bound, it triggers safepoint when their number grows above the bound, so their number settles a little above that)

Regards, Peter

On 04/27/2013 06:10 AM, 谢良 wrote:

Hi runtime-dev,

we hit a long STW issue on our hbase product cluster this morning.

$ java -version
java version "1.6.0_37"
Java(TM) SE Runtime Environment (build 1.6.0_37-b06)
Java HotSpot(TM) 64-Bit Server VM (build 20.12-b01, mixed mode)

per gc log :

013-04-27T01:53:20.573+0800: 908762.747: [GC 908762.747: [ParNew
Desired survivor size 286313672 bytes, new threshold 15 (max 15)
- age   1:    8829448 bytes,    8829448 total
- age   2:    6856168 bytes,   15685616 total
- age   3:    4441968 bytes,   20127584 total
- age   4:    6649776 bytes,   26777360 total
- age   5:    6648056 bytes,   33425416 total
- age   6:   10841760 bytes,   44267176 total
- age   7:    4549016 bytes,   48816192 total
- age   8:    8867144 bytes,   57683336 total
- age   9:   10832696 bytes,   68516032 total
- age  10:   17123880 bytes,   85639912 total
- age  11:    8733744 bytes,   94373656 total
- age  12:    8737944 bytes,  103111600 total
- age  13:   15025816 bytes,  118137416 total
- age  14:    8706400 bytes,  126843816 total
- age  15:    4552584 bytes,  131396400 total
: 470209K->130073K(699072K), 0.1957920 secs] 1037249K->701557K(10136256K), 0.1961640 secs] [Times: user=0.22 sys=0.00, real=0.20 secs]
Heap after GC invocations=667 (full 0):
 par new generation   total 699072K, used 130073K [0x0000000560000000, 0x00000005a0000000, 0x00000005a0000000)
  eden space 349568K,   0% used [0x0000000560000000, 0x0000000560000000, 0x0000000575560000)
  from space 349504K,  37% used [0x000000058aab0000, 0x00000005929b66e0, 0x00000005a0000000)
  to   space 349504K,   0% used [0x0000000575560000, 0x0000000575560000, 0x000000058aab0000)
 concurrent mark-sweep generation total 9437184K, used 571484K [0x00000005a0000000, 0x00000007e0000000, 0x00000007e0000000)
 concurrent-mark-sweep perm gen total 524288K, used 39683K [0x00000007e0000000, 0x0000000800000000, 0x0000000800000000)
}
Total time for which application threads were stopped: 148.6415470 seconds
Fortunately, we enabled "-XX:+PrintSafepointStatistics -XX:PrintSafepointStatisticsCount=1" options before,

         vmop                    [threads: total initially_running wait_to_block]    [time: spin block sync cleanup vmop] page_trap_count
908614.312: GenCollectForAllocation          [     328          1              2    ]      [     0     0148444     0   196    ]  0

seems we took too long for "sync" op (could any guys shed more lights about sync?),  then we found "SafepointTimeout" production option, but per my current
understanding it doesn't play on my scenario(sync took too long), just works well on spin&block op, right?

Here is the relatived code snippet:

      if (SafepointTimeout && safepoint_limit_time < os::javaTimeNanos()) {
        print_safepoint_timeout(_spinning_timeout);
      }
      ...
      if (remaining_time < 0 || Safepoint_lock->wait(true, remaining_time / MICROUNITS)) {
        print_safepoint_timeout(_blocking_timeout);
      }
      ...
      if (PrintSafepointStatistics) {
        update_statistics_on_sync_end(os::javaTimeNanos());
      }

To me seems in PRODUCT version it doesn't invoke any "print_safepoint_timeout" methods on "sync" stage  .

How could i diagnoise such long sync issue, any pointers will be highly welcome!

Thanks,
Liang