Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase, mail # user - Occasional regionserver crashes following socket errors writing to HDFS


Copy link to this message
-
Re: Occasional regionserver crashes following socket errors writing to HDFS
Michael Segel 2012-05-24, 12:13
http://wiki.apache.org/hadoop/Hbase/Troubleshooting#A8
<property>
    <name>zookeeper.session.timeout</name>
    <value>1200000</value>
  </property>
  <property>
    <name>hbase.zookeeper.property.tickTime</name>
    <value>6000</value>
  </property>
The default is 60 seconds which you reduced to 20.  (Assuming this is the right parameter)

As you said you were doing a major compaction at the time.
On May 24, 2012, at 6:15 AM, Eran Kutner wrote:

> Thanks Stack for noticing the ZooKeeper timeout, don't know how could I
> have missed that.
>
> After analyzing this for a while it is definitely unrelated to GC. In fact
> during the last 4 days no GC operation took more than 2 seconds, and those
> that got close were all concurrent mark sweeps, so they should not be
> stopping other threads.
>
> These are the interesting log lines:
> 2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 23706ms for sessionid
> 0x1372aa57bee0308, closing socket connection and attempting reconnect
> 2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
> session timed out, have not heard from server in 24638ms for sessionid
> 0x3372bf3891304bf, closing socket connection and attempting reconnect
> 2012-05-22 01:25:12,047 INFO org.apache.zookeeper.ClientCnxn: Opening
> socket connection to server hadoop1-zk1/10.1.104.201:2181
> 2012-05-22 01:25:12,048 INFO org.apache.zookeeper.ClientCnxn: Socket
> connection established to hadoop1-zk1/10.1.104.201:2181, initiating session
> 2012-05-22 01:25:12,080 INFO org.apache.zookeeper.ClientCnxn: Unable to
> reconnect to ZooKeeper service, session 0x3372bf3891304bf has expired,
> closing socket connection
> 2012-05-22 01:25:12,081 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> serverName=hadoop1-s05.farm-ny.gigya.com,60020,1336990798475,
> load=(requests=4015, regions=708, usedHeap=2342, maxHeap=7983):
> regionserver:60020-0x3372bf3891304bf regionserver:60020-0x3372bf3891304bf
> received expired from ZooKeeper, aborting
> org.apache.zookeeper.KeeperException$SessionExpiredException:
> KeeperErrorCode = Session expired
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
>        at
> org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
>        at
> org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)
>
> This is what the zookeeper logs show at the same time:
> 2012-05-22 01:24:46,014 - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
> read additional data from client sessionid 0x1372aa57bef6611, likely client
> has closed socket
> 2012-05-22 01:24:46,014 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
> client /10.1.104.4:57598 which had sessionid 0x1372aa57bef6611
> 2012-05-22 01:25:08,010 - ERROR [CommitProcessor:1:NIOServerCnxn@445] -
> Unexpected Exception:
> 2012-05-22 01:25:08,016 - INFO  [CommitProcessor:1:NIOServerCnxn@1435] -
> Closed socket connection for client /10.1.104.5:33945 which had sessionid
> 0x1372aa57bee0308
> 2012-05-22 01:25:12,046 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
> connection from /10.1.104.5:43070
> 2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
> session 0x3372bf3891304bf at /10.1.104.5:43070
> 2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 231702230809642175
> 2012-05-22 01:25:12,077 - INFO
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1573] - Invalid session
> 0x3372bf3891304bf for client /10.1.104.5:43070, probably expired
> 2012-05-22 01:25:12,078 - INFO  [NIOServerCxn.Factory: