Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> What to do when a node will not join the cluster?


Copy link to this message
-
Re: What to do when a node will not join the cluster?
Brian,

    Take a look in the configuration option initLimit and
syncLimit<http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html>,
this may help. I have beam some problema like that in a 3 node cluster due
the data size and quantity, the initial sync and even the role system was
messed up by some time running. In my case a rised that values and did some
trips to reduce/compact the data in zk.

On Mon, Nov 19, 2012 at 3:13 PM, Brian Tarbox <[EMAIL PROTECTED]> wrote:

> I have a four node cluster (I know, it should be odd) that generally runs
> fine but this morning I needed to restart the whole cluster and one of the
> nodes will not sync.  The node asks for a snapshot from the leader..waits
> for several minutes(!) and then fails.
>
> 11:46:55,130 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@294
> ]
> - Getting a snapshot from leader
> 11:47:01,535 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@325
> ]
> - Setting leader epoch e
> 11:47:21,707 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@341
> ]
> - Got zxid 0xe0000000a expected 0x1
> 11:55:01,515 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82
> ]
> - Exception when following the leader
> java.io.EOFException
>
> On the Leader side it appears to be sending the snapshot and then it fails.
> I have no idea how to proceed...any suggestion appreciated.
>
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@318] - Synchronizing with Follower sid: 4
> maxCommittedLog=0xe00000009 minCommittedLog=0xe00000001
> peerLastZxid=0x900323414
> 11:46:55,129 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@379] - Unhandled proposal scenario
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@395] - Sending SNAP
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@419] - Sending snapshot last zxid of peer is 0x900323414
>  zxid of leader is 0xe00000009sent zxid of db as 0xe00000009
> 11:55:01,513 [myid:5] - ERROR [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@562] - Unexpected exception causing shutdown while sock
> still open
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(Unknown Source)
>         at java.net.SocketInputStream.read(Unknown Source)
>         at java.io.BufferedInputStream.fill(Unknown Source)
>         at java.io.BufferedInputStream.read(Unknown Source)
>         at java.io.DataInputStream.readInt(Unknown Source)
>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
> 11:55:01,513 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@575] - ******* GOODBYE /172.16.10.200:46021 ********
>

--
Diego de Oliveira
[EMAIL PROTECTED]
www.diegooliveira.com
Never argue with a fool -- people might not be able to tell the difference