Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - Occasional regionserver crashes following socket errors writing to HDFS


+
Eran Kutner 2012-05-10, 08:17
+
dva 2012-08-30, 06:26
+
dva 2012-08-30, 06:26
+
Stack 2012-08-30, 22:36
+
Stack 2012-05-11, 05:07
+
Igal Shilman 2012-05-10, 09:25
+
Eran Kutner 2012-05-10, 11:33
+
Michel Segel 2012-05-10, 11:53
+
Eran Kutner 2012-05-10, 12:22
+
Michael Segel 2012-05-10, 13:26
+
Dave Revell 2012-05-10, 17:31
+
Michael Segel 2012-05-10, 18:30
+
Dave Revell 2012-05-10, 18:41
+
Michael Segel 2012-05-10, 18:59
+
Eran Kutner 2012-05-10, 19:17
+
Michael Segel 2012-05-10, 19:50
+
Stack 2012-05-10, 21:57
+
Michael Segel 2012-05-11, 02:46
+
Stack 2012-05-11, 03:34
+
Michael Segel 2012-05-11, 01:28
+
Stack 2012-05-11, 03:28
+
Michael Segel 2012-05-11, 03:44
+
Stack 2012-05-11, 03:53
+
Stack 2012-05-11, 05:12
+
Michael Segel 2012-05-11, 11:36
Copy link to this message
-
Re: Occasional regionserver crashes following socket errors writing to HDFS
Eran Kutner 2012-05-24, 11:15
Thanks Stack for noticing the ZooKeeper timeout, don't know how could I
have missed that.

After analyzing this for a while it is definitely unrelated to GC. In fact
during the last 4 days no GC operation took more than 2 seconds, and those
that got close were all concurrent mark sweeps, so they should not be
stopping other threads.

These are the interesting log lines:
2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 23706ms for sessionid
0x1372aa57bee0308, closing socket connection and attempting reconnect
2012-05-22 01:25:11,502 INFO org.apache.zookeeper.ClientCnxn: Client
session timed out, have not heard from server in 24638ms for sessionid
0x3372bf3891304bf, closing socket connection and attempting reconnect
2012-05-22 01:25:12,047 INFO org.apache.zookeeper.ClientCnxn: Opening
socket connection to server hadoop1-zk1/10.1.104.201:2181
2012-05-22 01:25:12,048 INFO org.apache.zookeeper.ClientCnxn: Socket
connection established to hadoop1-zk1/10.1.104.201:2181, initiating session
2012-05-22 01:25:12,080 INFO org.apache.zookeeper.ClientCnxn: Unable to
reconnect to ZooKeeper service, session 0x3372bf3891304bf has expired,
closing socket connection
2012-05-22 01:25:12,081 FATAL
org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
serverName=hadoop1-s05.farm-ny.gigya.com,60020,1336990798475,
load=(requests=4015, regions=708, usedHeap=2342, maxHeap=7983):
regionserver:60020-0x3372bf3891304bf regionserver:60020-0x3372bf3891304bf
received expired from ZooKeeper, aborting
org.apache.zookeeper.KeeperException$SessionExpiredException:
KeeperErrorCode = Session expired
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:352)
        at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:270)
        at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:531)
        at
org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:507)

This is what the zookeeper logs show at the same time:
2012-05-22 01:24:46,014 - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@634] - EndOfStreamException: Unable to
read additional data from client sessionid 0x1372aa57bef6611, likely client
has closed socket
2012-05-22 01:24:46,014 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.1.104.4:57598 which had sessionid 0x1372aa57bef6611
2012-05-22 01:25:08,010 - ERROR [CommitProcessor:1:NIOServerCnxn@445] -
Unexpected Exception:
2012-05-22 01:25:08,016 - INFO  [CommitProcessor:1:NIOServerCnxn@1435] -
Closed socket connection for client /10.1.104.5:33945 which had sessionid
0x1372aa57bee0308
2012-05-22 01:25:12,046 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn$Factory@251] - Accepted socket
connection from /10.1.104.5:43070
2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@770] - Client attempting to renew
session 0x3372bf3891304bf at /10.1.104.5:43070
2012-05-22 01:25:12,076 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:Learner@103] - Revalidating client: 231702230809642175
2012-05-22 01:25:12,077 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1573] - Invalid session
0x3372bf3891304bf for client /10.1.104.5:43070, probably expired
2012-05-22 01:25:12,078 - INFO  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435] - Closed socket connection for
client /10.1.104.5:43070 which had sessionid 0x3372bf3891304bf

I have zookeeper.session.timeout set to 20 seconds because I wanted quick
recovery in case of a failure.

Any idea why it would not respond in 20 seconds? Seems like quite a lot of
time.
Don't know if it's related or not but major compaction was happening while
this error occurred.

Thanks.

-eran

On Fri, May 11, 2012 at 2:36 PM, Michael Segel <[EMAIL PROTECTED]>wrote:

> So I see you're looking at Eran's problem.... ;-)
+
Michael Segel 2012-05-24, 12:13
+
Stack 2012-05-24, 23:39
+
Dave Revell 2012-05-25, 19:52
+
Stack 2012-05-11, 05:08