Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase >> mail # user >> Region servers going down under heavy write load


Copy link to this message
-
Region servers going down under heavy write load
Hi,

We have heavy map reduce write jobs running against our cluster. Every once
in a while, we see a region server going down.

We are on : 0.94.2-cdh4.2.0, r

We have done some tuning for heavy map reduce jobs, and have increased
scanner timeouts, lease timeouts, have also tuned memstore as follows:

hbase.hregion.memstore.block.multiplier: 4
hbase.hregion.memstore.flush.size: 134217728
hbase.hstore.blockingStoreFiles: 100

So now, we are still facing issues. Looking at the logs it looks like due
to zoo keeper timeout. We have tuned zookeeper settings as follows on
hbase-sie.xml:

zookeeper.session.timeout: 300000
hbase.zookeeper.property.tickTime: 6000
The actual log looks like:
2013-06-05 11:46:40,405 WARN org.apache.hadoop.ipc.HBaseServer:
(responseTooSlow):
{"processingtimems":13468,"call":"next(6723331143689528698, 1000), rpc
version=1, client version=29, methodsFingerPrint=54742778","client":"
10.20.73.65:41721
","starttimems":1370432786933,"queuetimems":1,"class":"HRegionServer","responsesize":39611416,"method":"next"}

2013-06-05 11:46:54,988 INFO org.apache.hadoop.io.compress.CodecPool: Got
brand-new decompressor [.snappy]

2013-06-05 11:48:03,017 WARN org.apache.hadoop.hdfs.DFSClient:
DFSOutputStream ResponseProcessor exception  for block
BP-53741567-10.20.73.56-1351630463427:blk_9026156240355850298_8775246
java.io.EOFException: Premature EOF: no length prefix available
        at
org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
        at
org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95)
        at
org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656)

2013-06-05 11:48:03,020 WARN org.apache.hadoop.hbase.util.Sleeper: *We
slept 48686ms instead of 3000ms*, this is likely due to a long garbage
collecting pause and it's usually bad, see
http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired

2013-06-05 11:48:03,094 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
smartdeals-hbase14-snc1.snc1,60020,1370373396890: Unhandled exception:
org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
currently processing smartdeals-hbase14-snc1.snc1,60020,1370373396890 as
dead server

(Not sure why it says 3000ms when we have timeout at 300000ms)

We have done some GC tuning as well. Wondering what I can tune from making
RS going down? Any ideas?
This is batch heavy cluster, and we care less about read latency. We can
increase RAM bit more but not much (Already RS has 20GB memory)

Thanks in advance.

Ameya
+
Kevin Odell 2013-06-05, 20:49
+
Ameya Kantikar 2013-06-05, 21:34
+
Ted Yu 2013-06-05, 21:45
+
Ameya Kantikar 2013-06-05, 22:18
+
Ameya Kantikar 2013-06-06, 00:45
+
Ted Yu 2013-06-06, 02:57
+
Stack 2013-06-06, 06:15
+
Stack 2013-06-06, 06:21
+
Ted Yu 2013-06-06, 16:33