Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Region servers going down under heavy write load


Copy link to this message
-
Re: Region servers going down under heavy write load
Ameya Kantikar 2013-06-05, 21:34
In zoo.cfg I have not setup this value explicitly. My zoo.cfg looks like:

tickTime=2000
initLimit=10
syncLimit=5

We use common zoo keeper cluster for 2 of our HBase clusters. I'll try
increasing this value from zoo.cfg.
However is it possible to set this value cluster specific?
I thought this property in hbase-site.xml takes care of that:
zookeeper.session.timeout
On Wed, Jun 5, 2013 at 1:49 PM, Kevin O'dell <[EMAIL PROTECTED]>wrote:

> Ameya,
>
>   What does your zoo.cfg say for your timeout value?
>
>
> On Wed, Jun 5, 2013 at 4:47 PM, Ameya Kantikar <[EMAIL PROTECTED]> wrote:
>
> > Hi,
> >
> > We have heavy map reduce write jobs running against our cluster. Every
> once
> > in a while, we see a region server going down.
> >
> > We are on : 0.94.2-cdh4.2.0, r
> >
> > We have done some tuning for heavy map reduce jobs, and have increased
> > scanner timeouts, lease timeouts, have also tuned memstore as follows:
> >
> > hbase.hregion.memstore.block.multiplier: 4
> > hbase.hregion.memstore.flush.size: 134217728
> > hbase.hstore.blockingStoreFiles: 100
> >
> > So now, we are still facing issues. Looking at the logs it looks like due
> > to zoo keeper timeout. We have tuned zookeeper settings as follows on
> > hbase-sie.xml:
> >
> > zookeeper.session.timeout: 300000
> > hbase.zookeeper.property.tickTime: 6000
> >
> >
> > The actual log looks like:
> >
> >
> > 2013-06-05 11:46:40,405 WARN org.apache.hadoop.ipc.HBaseServer:
> > (responseTooSlow):
> > {"processingtimems":13468,"call":"next(6723331143689528698, 1000), rpc
> > version=1, client version=29, methodsFingerPrint=54742778","client":"
> > 10.20.73.65:41721
> >
> >
> ","starttimems":1370432786933,"queuetimems":1,"class":"HRegionServer","responsesize":39611416,"method":"next"}
> >
> > 2013-06-05 11:46:54,988 INFO org.apache.hadoop.io.compress.CodecPool: Got
> > brand-new decompressor [.snappy]
> >
> > 2013-06-05 11:48:03,017 WARN org.apache.hadoop.hdfs.DFSClient:
> > DFSOutputStream ResponseProcessor exception  for block
> > BP-53741567-10.20.73.56-1351630463427:blk_9026156240355850298_8775246
> > java.io.EOFException: Premature EOF: no length prefix available
> >         at
> >
> >
> org.apache.hadoop.hdfs.protocol.HdfsProtoUtil.vintPrefixed(HdfsProtoUtil.java:162)
> >         at
> >
> >
> org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:95)
> >         at
> >
> >
> org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:656)
> >
> > 2013-06-05 11:48:03,020 WARN org.apache.hadoop.hbase.util.Sleeper: *We
> > slept 48686ms instead of 3000ms*, this is likely due to a long garbage
> > collecting pause and it's usually bad, see
> > http://hbase.apache.org/book.html#trouble.rs.runtime.zkexpired
> >
> > 2013-06-05 11:48:03,094 FATAL
> > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region
> server
> > smartdeals-hbase14-snc1.snc1,60020,1370373396890: Unhandled exception:
> > org.apache.hadoop.hbase.YouAreDeadException: Server REPORT rejected;
> > currently processing smartdeals-hbase14-snc1.snc1,60020,1370373396890 as
> > dead server
> >
> > (Not sure why it says 3000ms when we have timeout at 300000ms)
> >
> > We have done some GC tuning as well. Wondering what I can tune from
> making
> > RS going down? Any ideas?
> > This is batch heavy cluster, and we care less about read latency. We can
> > increase RAM bit more but not much (Already RS has 20GB memory)
> >
> > Thanks in advance.
> >
> > Ameya
> >
>
>
>
> --
> Kevin O'Dell
> Systems Engineer, Cloudera
>