Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> What to do when a node will not join the cluster?


Copy link to this message
-
What to do when a node will not join the cluster?
I have a four node cluster (I know, it should be odd) that generally runs
fine but this morning I needed to restart the whole cluster and one of the
nodes will not sync.  The node asks for a snapshot from the leader..waits
for several minutes(!) and then fails.

11:46:55,130 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@294]
- Getting a snapshot from leader
11:47:01,535 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@325]
- Setting leader epoch e
11:47:21,707 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@341]
- Got zxid 0xe0000000a expected 0x1
11:55:01,515 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82]
- Exception when following the leader
java.io.EOFException

On the Leader side it appears to be sending the snapshot and then it fails.
I have no idea how to proceed...any suggestion appreciated.

11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@318] - Synchronizing with Follower sid: 4
maxCommittedLog=0xe00000009 minCommittedLog=0xe00000001
peerLastZxid=0x900323414
11:46:55,129 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@379] - Unhandled proposal scenario
11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@395] - Sending SNAP
11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@419] - Sending snapshot last zxid of peer is 0x900323414
 zxid of leader is 0xe00000009sent zxid of db as 0xe00000009
11:55:01,513 [myid:5] - ERROR [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@562] - Unexpected exception causing shutdown while sock
still open
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.net.SocketInputStream.read(Unknown Source)
        at java.io.BufferedInputStream.fill(Unknown Source)
        at java.io.BufferedInputStream.read(Unknown Source)
        at java.io.DataInputStream.readInt(Unknown Source)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
        at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
11:55:01,513 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
:LearnerHandler@575] - ******* GOODBYE /172.16.10.200:46021 ********
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB