Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> RegionServers Crashing every hour in production env


Copy link to this message
-
Re: RegionServers Crashing every hour in production env
> That combo should be fine.

Great!!

 > If JVM is full GC'ing, the application is stopped.
 > The below does not look like a full GC but that is a long pause in system
 > time, enough to kill your zk session.

Exactly. This pause is really making the zk expire the RS which
shutsdown (logs
in the end of the email).
But the question is: what is causing this pause??!!

 > You swapping?

I don't think so (stats below).

 > Hardware is good?

Yes, it is a 16 processor machine with 74GB of RAM and plenty disk space.
Below are some metrics I have heard about. Hope it helps.
** I am having some problems with the datanodes[1] which are having
trouble to
write. I really think the issues are related, but cannot solve any of
them :(

Thanks again,
Pablo

[1]
http://mail-archives.apache.org/mod_mbox/hadoop-hdfs-user/201303.mbox/%3CCAJzooYfS-F1KS+[EMAIL PROTECTED]%3E

top - 15:38:04 up 297 days, 21:03,  2 users,  load average: 4.34, 2.55, 1.28
Tasks: 528 total,   1 running, 527 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.2%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi, 0.0%si,  
0.0%st
Mem:  74187256k total, 29493992k used, 44693264k free,  5836576k buffers
Swap: 51609592k total,   128312k used, 51481280k free,  1353400k cached

]$ vmstat -w
procs -------------------memory------------------ ---swap-- -----io----
  r  b       swpd       free       buff      cache   si   so    bi bo  
in   cs  us sy  id wa st
  2  0     128312   32416928    5838288    5043560    0    0   202 53    
0    0   2  1  96  1  0

]$ sar
02:20:01 PM     all     26.18      0.00      2.90      0.63 0.00     70.29
02:30:01 PM     all      1.66      0.00      1.25      1.05 0.00     96.04
02:40:01 PM     all     10.01      0.00      2.14      0.75 0.00     87.11
02:50:01 PM     all      0.76      0.00      0.80      1.03 0.00     97.40
03:00:01 PM     all      0.23      0.00      0.30      0.71 0.00     98.76
03:10:01 PM     all      0.22      0.00      0.30      0.66 0.00     98.82
03:20:01 PM     all      0.22      0.00      0.31      0.76 0.00     98.71
03:30:01 PM     all      0.24      0.00      0.31      0.64 0.00     98.81
03:40:01 PM     all      1.13      0.00      2.97      1.18 0.00     94.73
Average:        all      3.86      0.00      1.38      0.88 0.00     93.87

]$ iostat
Linux 2.6.32-220.7.1.el6.x86_64 (PSLBHDN002)     03/10/2013 _x86_64_    
(16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
            1.86    0.00    0.96    0.78    0.00   96.41

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read Blk_wrtn
sda               1.23        20.26        23.53  521533196 605566924
sdb               6.51       921.55       241.90 23717850730 6225863488
sdc               6.22       921.83       236.41 23725181162 6084471192
sdd               6.25       925.13       237.26 23810004970 6106357880
sde               6.19       913.90       235.60 23521108818 6063722504
sdh               6.26       933.08       237.77 24014594546 6119511376
sdg               6.18       914.36       235.31 23532747378 6056257016
sdf               6.24       923.66       235.33 23772251810 6056604008

Some more logging which reinforce that the RS crash is happening because of
timeout. However this time the GC log is not accusing a big time.

#####RS LOG#####
2013-03-10 15:37:46,712 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 257739ms for sessionid
0x13d3c4bcba6014a, closing socket connection and attempting reconnect
2013-03-10 15:37:46,712 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 226785ms for sessionid
0x13d3c4bcba60149, closing socket connection and attempting reconnect
2013-03-10 15:37:46,712 DEBUG
org.apache.hadoop.hbase.io.hfile.LruBlockCache: Stats: total=61.91 MB,
free=1.94 GB, max=2 GB, blocks=1254, accesses=60087, hits=58811,
hitRatio=97.87%, , cachingAccesses=60069, cachingHits=58811,
cachingHitsRatio=97.90%, , evictions=0, evicted=0, evictedPerRun=NaN
2013-03-10 15:37:46,712 WARN org.apache.hadoop.hbase.util.Sleeper: We
slept 225100ms instead of 3000ms, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
2013-03-10 15:37:46,714 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
BP-43236042-172.17.2.10-1362490844340:blk_-6834190810033122569_25150229
java.io.EOFException: Premature EOF: no length prefix available
         at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:171)
         at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:114)
         at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:670)
2013-03-10 15:37:46,716 ERROR
org.apache.hadoop.hbase.regionserver.HRegionServer:
org.apache.hadoop.hbase.ipc.CallerDisconnectedException: Aborting call
get([B@7caf69ed,
{"timeRange":[0,9223372036854775807],"totalColumns":1,"cacheBlocks":true,"families":{"details":["ALL"]},"maxVersions":1,"row":"\\x00\\x00\\x00\\x00\\x00\\x12\\x93@"}),
rpc version=1, client version=29, methodsFingerPrint=1891768260 from
172.17.1.71:51294 after 0 ms, since caller disconnected
#####GC LOG#####
2716.635: [GC 2716.635: [ParNew: 57570K->746K(59008K), 0.0046530 secs]
354857K->300891K(1152704K), 0.0047370 secs] [Times: user=0.03 sys=0.00,
real=0.00 secs]
2789.478: [GC 2789.478: [ParNew: 53226K->1192K(59008K), 0.0041370 secs]
353371K->301337K(1152704K), 0.0042220 secs] [Times: user=0.04 sys=0.00,
real=0.00 secs]
2868.435: [GC 2868.435: [ParNew: 53672K->740K(59008K), 0.0041570 secs]
353817K->300886K(1152704K), 0.0042440 secs] [Times: user=0.03 sys=0.00,
real=0.01 secs]
2920.309: [GC 2920.309: [ParNew: 53220K->6202K(59008K), 0.0058440 secs]
353366K->310443K(1152704K), 0.0059410 secs] [Times: user=0.05 sys=0.00,
real=0
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB