Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HBase >> mail # user >> Occasional regionserver crashes following socket errors writing to HDFS


Copy link to this message
-
Re: Occasional regionserver crashes following socket errors writing to HDFS
Thanks Stack for noticing the ZooKeeper timeout, don't know how could I
have missed that.

After analyzing this for a while it is definitely unrelated to GC. In fact
during the last 4 days no GC operation took more than 2 seconds, and those
that got close were all concurrent mark sweeps, so they should not be
stopping other threads.

These are the interesting log lines:
2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 23706ms for sessionid
0x1372aa57bee0308, closing socket connection and attempting reconnect
2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 24638ms for sessionid
0x3372bf3891304bf, closing socket connection and attempting reconnect
2012-05-22 01:25:12,047 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop1-zk1/10.1.104.201:2181
2012-05-22 01:25:12,048 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop1-zk1/10.1.104.201:2181, initiating session
2012-05-22 01:25:12,080 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x3372bf3891304bf has expired,
closing socket connection
2012-05-22 01:25:12,081 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=hadoop1-s05.farm-ny.gigya.com,60020,1336990798475,
load=(requests=4015, regions=708, usedHeap=2342, maxHeap=7983):
regionserver:60020-0x3372bf3891304bf regionserver:60020-0x3372bf3891304bf
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)

This is what the zookeeper logs show at the same time:
2012-05-22 01:24:46,014 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
read additional data from client sessionid 0x1372aa57bef6611, likely client
has closed socket
2012-05-22 01:24:46,014 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.1.104.4:57598 which had sessionid 0x1372aa57bef6611
2012-05-22 01:25:08,010 - ERROR [CommitProcessor:1:NIOServerCnxn@445] -
Unexpected Exception:
2012-05-22 01:25:08,016 - INFO  [CommitProcessor:1:NIOServerCnxn@1435] -
Closed socket connection for client /10.1.104.5:33945 which had sessionid
0x1372aa57bee0308
2012-05-22 01:25:12,046 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
connection from /10.1.104.5:43070
2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
session 0x3372bf3891304bf at /10.1.104.5:43070
2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 231702230809642175
2012-05-22 01:25:12,077 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1573] - Invalid session
0x3372bf3891304bf for client /10.1.104.5:43070, probably expired
2012-05-22 01:25:12,078 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.1.104.5:43070 which had sessionid 0x3372bf3891304bf

I have zookeeper.session.timeout set to 20 seconds because I wanted quick
recovery in case of a failure.

Any idea why it would not respond in 20 seconds? Seems like quite a lot of
time.
Don't know if it's related or not but major compaction was happening while
this error occurred.

Thanks.

-eran

On Fri, May 11, 2012 at 2:36 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> So I see you're looking at Eran's problem.... ;-)
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB