Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> Corrupt files across all nodes in a 3 node cluster on shutdown


Copy link to this message
-
Re: Corrupt files across all nodes in a 3 node cluster on shutdown
Hi Todd,

The message that you attached below shows that all 3 nodes were able to form
the cluster (hence no DB corruption).  However, follower 1 closed its
connection with leader 0, which caused the eof exception at 1. Without
looking at the logs of the other nodes it is difficult to analyze this
problem. After you remount the FS is the FS ready to do IO (no underlying
fsck or raid recovery in progress)?
On Mon, Apr 11, 2011 at 4:16 AM, Todd Nine <[EMAIL PROTECTED]> wrote:

> Hi all,
>  I'm running version 3.3.2 with 3 nodes.  The workload of ZK itself is very
> small, it's primarily used for leader/follower election and low volume
> inter
> node communication(1 or 2 messages per second).  We're hosted on Amazon EC2
> using a raid EBS.  I've noticed on several occasions that when we shutdown
> all 3 nodes at once, we eventually reach a state where zookeeper cannot be
> restarted.  I attach the EBS drives, re-attach the raid volumes and mount
> the file systems successfully.  However, all 3 nodes fail to start.  I
> consistently receive errors such as the one included below.  The only
> resolution seems to be to completely remove all the data from the
> /mnt/zookeeperdata/version-2 directory.  Any ideas why this is happening?
> We're migrating our quartz implementation over to ZK for job and trigger
> coordination in the cluster.  Once that happens we can't have data loss.
> Any help would be greatly appreciated.
>
> Thanks,
> Todd
>
> Stack Track
>
> 011-04-11 08:00:25,572 - INFO
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@151] - Created server
> with
> tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
> /mnt/zookeeperdata/version-2 snapdir /mnt/zookeeperdata/version-2
> 2011-04-11 08:00:25,576 - INFO
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileSnap@82] - Reading snapshot
> /mnt/zookeeperdata/version-2/snapshot.200000000
> 2011-04-11 08:00:25,580 - INFO
> [QuorumPeer:/0:0:0:0:0:0:0:0:2181:FileTxnSnapLog@208] - Snapshotting:
> 200000047
> 2011-04-11 08:00:25,602 - INFO  [LearnerHandler-/10.161.98.28:59536
> :LearnerHandler@247] - Follower sid: 0 : info :
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServ
> er@21b64e6a
> 2011-04-11 08:00:25,603 - WARN  [LearnerHandler-/10.161.98.28:59536
> :LearnerHandler@326] - Sending snapshot last zxid of peer is 0x200000047
> zxid of leader is 0x300000000
> 2011-04-11 08:00:25,606 - WARN  [LearnerHandler-/10.161.98.28:59536
> :Leader@471] - Commiting zxid 0x300000000 from /10.160.137.155:2888 not
> first!
> 2011-04-11 08:00:25,606 - WARN  [LearnerHandler-/10.161.98.28:59536
> :Leader@473] - First is 0
> 2011-04-11 08:00:25,606 - INFO  [LearnerHandler-/10.161.98.28:59536
> :Leader@497] - Have quorum of supporters; starting up and setting last
> processed zxid: 12884901888
> 2011-04-11 08:01:19,835 - INFO  [WorkerReceiver
> Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 352192519805
> (n.zxid), 1 (n.round), LOOKING (n.state), 2 (n.sid),
> LEADING (my state)
> 2011-04-11 08:02:33,688 - INFO  [WorkerReceiver
> Thread:FastLeaderElection@496] - Notification: 2 (n.leader), 0 (n.zxid), 2
> (n.round), LOOKING (n.state), 2 (n.sid), LEADING (my
>  state)
> 2011-04-11 08:02:33,700 - INFO  [LearnerHandler-/10.160.246.112:40243
> :LearnerHandler@247] - Follower sid: 2 : info :
> org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer
> @bd10a5c
> 2011-04-11 08:02:33,701 - WARN  [LearnerHandler-/10.160.246.112:40243
> :LearnerHandler@326] - Sending snapshot last zxid of peer is 0x0  zxid of
> leader is 0x300000000
> 2011-04-11 08:02:34,283 - ERROR [LearnerHandler-/10.160.246.112:40243
> :LearnerHandler@466] - Unexpected exception causing shutdown while sock
> still open
> java.io.EOFException
>        at java.io.DataInputStream.readInt(DataInputStream.java:375)
> at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>        at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:84)
> at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)