Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - hbase region server shutdown after datanode connection exception


Copy link to this message
-
Re: hbase region server shutdown after datanode connection exception
Jean-Daniel Cryans 2013-05-23, 16:52
You are looking at it the wrong way. Per
http://hbase.apache.org/book.html#trouble.general, always walk up the
log to the first exception. In this case it's a session timeout.
Whatever happens next is most probably a side effect of that.

To help debug your issue, I would suggest reading this section of the
reference guide: http://hbase.apache.org/book.html#trouble.rs.runtime

J-D

On Tue, May 21, 2013 at 7:17 PM, Cheng Su <[EMAIL PROTECTED]> wrote:
> Hi all.
>
>
>
>          I have a small hbase cluster with 3 physical machines.
>
>          On 192.168.1.80, there are HMaster and a region server. On 81 & 82,
> there is a region server on each.
>
>          The region server on 80 can't sync HLog after a datanode access
> exception, and started to shutdown.
>
>          The datanode itself was not shutdown and response other requests
> normally. I'll paste logs below.
>
>          My question is:
>
>          1. Why this exception causes region server shutdown? Can I prevent
> it?
>
>          2. Is there any tools(shell command is best, like hadoop dfsadmin
> -report) can monitor hbase region server? to check whether it is alive or
> dead?
>
>            I have done some research that nagios/ganglia can do such things.
>
>
>       But actually I just want know the region server is alive or dead, so
> they are a little over qualify.
>
>            And I'm not using CDH, so I can't use Cloudera Manager I think.
>
>
>
>          Here are the logs.
>
>
>
>          HBase master:
> 2013-05-21 17:03:32,675 ERROR org.apache.hadoop.hbase.master.HMaster: Region
> server ^@^@hadoop01,60020,1368774173179 reported a fatal error:
>
> ABORTING region server hadoop01,60020,1368774173179:
> regionserver:60020-0x3eb14c67540002 regionserver:60020-0x3eb14c67540002
> received expired from ZooKeeper, aborting
>
> Cause:
>
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeper
> Watcher.java:369)
>
>         at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.
> java:266)
>
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521
> )
>
>         at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
>
>
>
>          Region Server:
>
> 2013-05-21 17:00:16,895 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 120000ms for sessionid
> 0x3eb14c67540002, closing socket connection and attempting re
>
> connect
>
> 2013-05-21 17:00:35,896 INFO org.apache.zookeeper.ClientCnxn: Client session
> timed out, have not heard from server in 120000ms for sessionid
> 0x13eb14ca4bb0000, closing socket connection and attempting r
>
> econnect
>
> 2013-05-21 17:03:31,498 WARN org.apache.hadoop.hdfs.DFSClient:
> DFSOutputStream ResponseProcessor exception  for block
> blk_9188414668950016309_4925046java.net.SocketTimeoutException: 63000 millis
> timeout
>
>  while waiting for channel to be ready for read. ch :
> java.nio.channels.SocketChannel[connected local=/192.168.1.80:57020
> remote=/192.168.1.82:50010]
>
>         at
> org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
>
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
>
>         at
> org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
>
>         at java.io.DataInputStream.readFully(DataInputStream.java:178)
>
>         at java.io.DataInputStream.readLong(DataInputStream.java:399)
>
>         at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.
> readFields(DataTransferProtocol.java:124)
>
>         at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSCl
> ient.java:2784)
>
>
>
> 2013-05-21 17:03:31,520 WARN org.apache.hadoop.hdfs.DFSClient: Error
> Recovery for block blk_9188414668950016309_4925046 bad datanode[0]
> 192.168.1.82:50010