Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - Corrupt files across all nodes in a 3 node cluster on shutdown


Copy link to this message
-
Corrupt files across all nodes in a 3 node cluster on shutdown
Todd Nine 2011-04-11, 08:16
Hi all,
  I'm running version 3.3.2 with 3 nodes.  The workload of ZK itself is very
small, it's primarily used for leader/follower election and low volume inter
node communication(1 or 2 messages per second).  We're hosted on Amazon EC2
using a raid EBS.  I've noticed on several occasions that when we shutdown
all 3 nodes at once, we eventually reach a state where zookeeper cannot be
restarted.  I attach the EBS drives, re-attach the raid volumes and mount
the file systems successfully.  However, all 3 nodes fail to start.  I
consistently receive errors such as the one included below.  The only
resolution seems to be to completely remove all the data from the
/mnt/zookeeperdata/version-2 directory.  Any ideas why this is happening?
We're migrating our quartz implementation over to ZK for job and trigger
coordination in the cluster.  Once that happens we can't have data loss.
Any help would be greatly appreciated.

Thanks,
Todd

Stack Track

011-04-11 08:00:25,572 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server with
tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
/mnt/zookeeperdata/version-2 snapdir /mnt/zookeeperdata/version-2
2011-04-11 08:00:25,576 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
/mnt/zookeeperdata/version-2/snapshot.200000000
2011-04-11 08:00:25,580 - INFO
[QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@208] - Snapshotting:
200000047
2011-04-11 08:00:25,602 - INFO  [LearnerHandler-/10.161.98.28:59536
:LearnerHandler@247] - Follower sid: 0 : info :
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServ
er@21b64e6a
2011-04-11 08:00:25,603 - WARN  [LearnerHandler-/10.161.98.28:59536
:LearnerHandler@326] - Sending snapshot last zxid of peer is 0x200000047
zxid of leader is 0x300000000
2011-04-11 08:00:25,606 - WARN  [LearnerHandler-/10.161.98.28:59536
:Leader@471] - Commiting zxid 0x300000000 from /10.160.137.155:2888 not
first!
2011-04-11 08:00:25,606 - WARN  [LearnerHandler-/10.161.98.28:59536
:Leader@473] - First is 0
2011-04-11 08:00:25,606 - INFO  [LearnerHandler-/10.161.98.28:59536
:Leader@497] - Have quorum of supporters; starting up and setting last
processed zxid: 12884901888
2011-04-11 08:01:19,835 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 352192519805
(n.zxid), 1 (n.round), LOOKING (n.state), 2 (n.sid),
LEADING (my state)
2011-04-11 08:02:33,688 - INFO  [WorkerReceiver
Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 2
(n.round), LOOKING (n.state), 2 (n.sid), LEADING (my
 state)
2011-04-11 08:02:33,700 - INFO  [LearnerHandler-/10.160.246.112:40243
:LearnerHandler@247] - Follower sid: 2 : info :
org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer
@bd10a5c
2011-04-11 08:02:33,701 - WARN  [LearnerHandler-/10.160.246.112:40243
:LearnerHandler@326] - Sending snapshot last zxid of peer is 0x0  zxid of
leader is 0x300000000
2011-04-11 08:02:34,283 - ERROR [LearnerHandler-/10.160.246.112:40243
:LearnerHandler@466] - Unexpected exception causing shutdown while sock
still open
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
        at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:380)
2011-04-11 08:02:34,284 - WARN  [Thread-7:QuorumCnxManager$RecvWorker@702] -
Connection broken for id 2, my id = 1, error = java.io.IOException: Channel
eof
2011-04-11 08:02:34,285 - WARN  [LearnerHandler-/10.160.246.112:40243
:LearnerHandler@479] - ******* GOODBYE /10.160.246.112:40243 ********
2011-04-11 08:02:59,364 - WARN  [Thread-6:QuorumCnxManager$SendWorker@612] -
Interrupted while waiting for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:1961)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2038)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:342)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:601)

zoo.cfg

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
dataDir=/mnt/zookeeperdata
# the port at which the clients will connect
clientPort=2181

#the max number of client connections
maxClientCnxns=50
#peer_port2888leader_port3888ipaddress10.161.98.28
server.0=10.161.98.28:2888:3888
#peer_port2888leader_port3888ipaddress10.160.137.155
server.1=10.160.137.155:2888:3888
#peer_port2888leader_port3888ipaddress10.160.246.112
server.2=10.160.246.112:2888:3888