Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # user - RegionServers Crashing every hour in production env


Copy link to this message
-
RegionServers Crashing every hour in production env
Pablo Musa 2013-03-08, 15:44
Hey guys,
as I sent in an email a long time ago, the RSs in my cluster did not get
along
and crashed 3 times a day. I tried a lot of options we discussed in the
emails, but it not solved the problem. As I used an old version of hadoop I
thought this was the problem.

So, I upgraded from hadoop 0.20 - hbase 0.90 - zookeeper 3.3.5 to hadoop
2.0.0
- hbase 0.94 - zookeeper 3.4.5.

Unfortunately the RSs did not stop crashing, and worst! Now they crash every
hour and some times when the RS that holds the .ROOT. crashes all
cluster get
stuck in transition and everything stops working.
In this case I need to clean zookeeper znodes, restart the master and
the RSs.
To avoid this case I am running on production with only ONE RS and a
monitoring
script that check every minute, if the RS is ok. If not, restart it.
* This case does not get the cluster stuck.

This is driving me crazy, but I really cant find a solution for the cluster.
I tracked all logs from the start time 16:49 from all interesting nodes
(zoo,
namenode, master, rs, dn2, dn9, dn10) and copied here what I think is
usefull.

There are some strange errors in the DATANODE2, as an error copiyng a block
to itself.

The gc log points to GC timeout. However it is very weird that the RS spend
so much time in GC while in the other cases it takes 0.001sec. Besides,
the time
spent, is in sys which makes me think that might be a problem in another
place.

I know that it is a bunch of logs, and that it is very difficult to find the
problem without much context. But I REALLY need some help. If it is not the
solution, at least what I should read, where I should look, or which cases I
should monitor.

Thank you very much,
Pablo Musa

Start RS at 16:49
####ZOOKEEPER-PSLBHNN001####
2013-03-07 16:49:04,369 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] -
Accepted socket connection from /172.17.2.18:33549
2013-03-07 16:49:04,372 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client
attempting to establish new session at /172.17.2.18:33549
2013-03-07 16:49:04,373 [myid:] - INFO
[SyncThread:0:ZooKeeperServer@595] - Established session
0x13d3c4bcba600a7 with negotiated timeout 150000 for client
/172.17.2.18:33549
2013-03-07 16:49:04,458 [myid:] - INFO  [ProcessThread(sid:0
cport:-1)::PrepRequestProcessor@627] - Got user-level KeeperException
when processing sessionid:0x13d3c4bcba600a7 type:create cxid:0x8
zxid:0x11010ade3a txntype:-1 reqpath:n/a Error
Path:/hbase/online-snapshot/acquired Error:KeeperErrorCode = NodeExists
for /hbase/online-snapshot/acquired
2013-03-07 16:49:09,036 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] -
Accepted socket connection from /172.17.2.18:33675
2013-03-07 16:49:09,036 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client
attempting to establish new session at /172.17.2.18:33675
2013-03-07 16:49:09,041 [myid:] - INFO
[SyncThread:0:ZooKeeperServer@595] - Established session
0x13d3c4bcba600a8 with negotiated timeout 150000 for client
/172.17.2.18:33675
2013-03-07 17:24:50,001 [myid:] - INFO
[SessionTracker:ZooKeeperServer@325] - Expiring session
0x13d3c4bcba600a7, timeout of 150000ms exceeded
2013-03-07 17:24:50,001 [myid:] - INFO  [ProcessThread(sid:0
cport:-1)::PrepRequestProcessor@476] - Processed session termination for
sessionid: 0x13d3c4bcba600a7
2013-03-07 17:24:50,002 [myid:] - INFO [SyncThread:0:NIOServerCnxn@1001]
- Closed socket connection for client /172.17.2.18:33549 which had
sessionid 0x13d3c4bcba600a7
2013-03-07 17:24:57,889 [myid:] - WARN
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@349] - caught
end of stream exception
EndOfStreamException: Unable to read additional data from client
sessionid 0x13d3c4bcba600a8, likely client has closed socket
         at
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:220)
         at
org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
         at java.lang.Thread.run(Thread.java:722)
2013-03-07 17:24:57,890 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed
socket connection for client /172.17.2.18:33675 which had sessionid
0x13d3c4bcba600a8
2013-03-07 17:25:00,001 [myid:] - INFO
[SessionTracker:ZooKeeperServer@325] - Expiring session
0x13d3c4bcba600a8, timeout of 150000ms exceeded
2013-03-07 17:25:00,001 [myid:] - INFO  [ProcessThread(sid:0
cport:-1)::PrepRequestProcessor@476] - Processed session termination for
sessionid: 0x13d3c4bcba600a8
2013-03-07 17:26:03,211 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] -
Accepted socket connection from /172.17.2.18:35983
2013-03-07 17:26:03,214 [myid:] - INFO
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@839] - Client
attempting to establish new session at /172.17.2.18:35983
2013-03-07 17:26:03,215 [myid:] - INFO
[SyncThread:0:ZooKeeperServer@595] - Established session
0x13d3c4bcba600a9 with negotiated timeout 150000 for client
/172.17.2.18:35983

####MASTER-PSLBHNN001####
2013-03-07 16:49:00,379 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
unassigned = 1
2013-03-07 16:49:04,517 DEBUG
org.apache.hadoop.hbase.master.ServerManager: STARTUP: Server
PSLBHDN002,60020,1362685744154 came back up, removed it from the dead
servers list
2013-03-07 16:49:04,517 INFO
org.apache.hadoop.hbase.master.ServerManager: Registering
server=PSLBHDN002,60020,1362685744154
2013-03-07 16:49:05,380 DEBUG
org.apache.hadoop.hbase.master.SplitLogManager: total tasks = 1
unassigned = 1
2013-03-07 16:49:05,483 INFO
org.apache.hadoop.hbase.master.SplitLogManager: task
/hbase/splitlog/hdfs%3A%2F%2Fpslbhnn001.psafe.net%2Fhbase%2F.logs%2FPSLBHDN002%2C60020%2C1362683643918-splitting%2FPSLBHDN002%252C60020%252C1362683643918.1362683644662
acquired by PSLBHDN002,60020,1362685744154
2013-03-07 16:49
+
Ted Yu 2013-03-08, 16:01
+
ramkrishna vasudevan 2013-03-08, 16:32
+
Stack 2013-03-08, 17:11
+
Pablo Musa 2013-03-08, 18:58
+
Stack 2013-03-08, 22:02
+
Pablo Musa 2013-03-10, 18:59
+
Sreepathi 2013-03-10, 19:06
+
Pablo Musa 2013-03-10, 22:29
+
Stack 2013-03-10, 22:41
+
Azuryy Yu 2013-03-11, 02:13
+
Andrew Purtell 2013-03-11, 02:24
+
Pablo Musa 2013-03-12, 15:43
+
Pablo Musa 2013-04-03, 18:21
+
Ted Yu 2013-04-03, 18:36
+
Pablo Musa 2013-04-03, 20:24
+
Ted Yu 2013-04-03, 21:40
+
Azuryy Yu 2013-03-11, 02:14