Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)


Copy link to this message
-
Re: RegionServer silently stops (only "issue": CMS-concurrent-mark ~80sec)


Not sure if its related (or even helpful) but we were using cdh3b4 (which is 0.90.1) and we saw similar issues with region servers going down.. we didn't look at GC logs but we had very high zookeeper leases so its unlikely that the GC could have caused the issue.. this problem went away when we upgraded to cdh3u3 which is rock steady in terms of region servers.. (havent had a single region server crash in a month where on the older version I used to have 1 crash every couple of days).. the only other difference between the two is that we use snappy on the newer one and gz on the old

We also noticed that having replication enabled also contributed to the issues..
------------------------------
On Tue 1 May, 2012 3:15 PM IST N Keywal wrote:

>Hi Alex,
>
>On the same idea, note that hbase is launched with
>-XX:OnOutOfMemoryError="kill -9 %p".
>
>N.
>
>On Tue, May 1, 2012 at 10:41 AM, Igal Shilman <[EMAIL PROTECTED]> wrote:
>
>> Hi Alex, just to rule out, oom killer,
>> Try this:
>>
>> http://stackoverflow.com/questions/624857/finding-which-process-was-killed-by-linux-oom-killer
>>
>>
>> On Mon, Apr 30, 2012 at 10:48 PM, Alex Baranau <[EMAIL PROTECTED]
>> >wrote:
>>
>> > Hello,
>> >
>> > During recent weeks I constantly see some RSs *silently* dying on our
>> HBase
>> > cluster. By "silently" I mean that process stops, but no errors in logs
>> > [1].
>> >
>> > The only thing I can relate to it is long CMS-concurrent-mark: almost 80
>> > seconds. But this should not cause issues as it is not a "stop-the-world"
>> > process.
>> >
>> > Any advice?
>> >
>> > HBase: hbase-0.90.4-cdh3u3
>> > Hadoop: 0.20.2-cdh3u3
>> >
>> > Thank you,
>> > Alex Baranau
>> >
>> > [1]
>> >
>> > last lines from RS log (no errors before too, and nothing written in
>> *.out
>> > file):
>> >
>> > 2012-04-30 18:52:11,806 DEBUG
>> > org.apache.hadoop.hbase.regionserver.CompactSplitThread: Compaction
>> > requested for agg-sa-1.3,0011|
>> >
>> >
>> te|dtc|\x00\x00\x00\x00\x00\x00<\x1E\x002\x00\x00\x00\x015\x9C_n\x00\x00\x00\x00\x00\x00\x00\x00\x00,1334852280902.4285f9339b520ee617c087c0fd0dbf65.
>> > because regionserver60020.cacheFlusher; priority=-1, compaction queue
>> > size=0
>> > 2012-04-30 18:54:58,779 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter: using new
>> > createWriter -- HADOOP-6840
>> > 2012-04-30 18:54:58,779 DEBUG
>> > org.apache.hadoop.hbase.regionserver.wal.SequenceFileLogWriter:
>> >
>> >
>> Path=hdfs://xxx.ec2.internal/hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651,
>> > syncFs=true, hflush=false
>> > 2012-04-30 18:54:58,874 INFO
>> org.apache.hadoop.hbase.regionserver.wal.HLog:
>> > Roll
>> >
>> >
>> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335811856672,
>> > entries=73789, filesize=63773934. New hlog
>> >
>> >
>> /hbase/.logs/xxx.ec2.internal,60020,1335706613397/xxx.ec2.internal%3A60020.1335812098651
>> > 2012-04-30 18:56:31,867 INFO
>> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush thread woke
>> up
>> > with memory above low water.
>> > 2012-04-30 18:56:31,867 INFO
>> > org.apache.hadoop.hbase.regionserver.MemStoreFlusher: Flush of region
>> > agg-sa-1.3,s_00I4|
>> >
>> >
>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.
>> > due to global heap pressure
>> > 2012-04-30 18:56:31,867 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion:
>> > Started memstore flush for agg-sa-1.3,s_00I4|
>> >
>> >
>> tdqc\x00docs|mrtdocs|\x00\x00\x00\x00\x00\x03\x11\xF4\x00none\x00|1334692562\x00\x0D\xE0\xB6\xB3\xA7c\xFF\xBC|26837373\x00\x00\x00\x016\xC1\xE0D\xBE\x00\x00\x00\x00\x00\x00\x00\x00,1335761291026.30b127193485342359eadf1586819805.,
>> > current region memstore size 138.1m
>> > 2012-04-30 18:56:31,867 DEBUG
>> org.apache.hadoop.hbase.regionserver.HRegion: