Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper >> mail # user >> What to do when a node will not join the cluster?


+
Brian Tarbox 2012-11-19, 17:13
Copy link to this message
-
Re: What to do when a node will not join the cluster?
Brian,

    Take a look in the configuration option initLimit and
syncLimit<http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html>,
this may help. I have beam some problema like that in a 3 node cluster due
the data size and quantity, the initial sync and even the role system was
messed up by some time running. In my case a rised that values and did some
trips to reduce/compact the data in zk.

On Mon, Nov 19, 2012 at 3:13 PM, Brian Tarbox <[EMAIL PROTECTED]> wrote:

> I have a four node cluster (I know, it should be odd) that generally runs
> fine but this morning I needed to restart the whole cluster and one of the
> nodes will not sync.  The node asks for a snapshot from the leader..waits
> for several minutes(!) and then fails.
>
> 11:46:55,130 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@294
> ]
> - Getting a snapshot from leader
> 11:47:01,535 [myid:] - INFO  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@325
> ]
> - Setting leader epoch e
> 11:47:21,707 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Learner@341
> ]
> - Got zxid 0xe0000000a expected 0x1
> 11:55:01,515 [myid:] - WARN  [QuorumPeer:/0:0:0:0:0:0:0:0:2181:Follower@82
> ]
> - Exception when following the leader
> java.io.EOFException
>
> On the Leader side it appears to be sending the snapshot and then it fails.
> I have no idea how to proceed...any suggestion appreciated.
>
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@318] - Synchronizing with Follower sid: 4
> maxCommittedLog=0xe00000009 minCommittedLog=0xe00000001
> peerLastZxid=0x900323414
> 11:46:55,129 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@379] - Unhandled proposal scenario
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@395] - Sending SNAP
> 11:46:55,129 [myid:5] - INFO  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@419] - Sending snapshot last zxid of peer is 0x900323414
>  zxid of leader is 0xe00000009sent zxid of db as 0xe00000009
> 11:55:01,513 [myid:5] - ERROR [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@562] - Unexpected exception causing shutdown while sock
> still open
> java.net.SocketTimeoutException: Read timed out
>         at java.net.SocketInputStream.socketRead0(Native Method)
>         at java.net.SocketInputStream.read(Unknown Source)
>         at java.net.SocketInputStream.read(Unknown Source)
>         at java.io.BufferedInputStream.fill(Unknown Source)
>         at java.io.BufferedInputStream.read(Unknown Source)
>         at java.io.DataInputStream.readInt(Unknown Source)
>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
>         at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:450)
> 11:55:01,513 [myid:5] - WARN  [LearnerHandler-/172.16.10.200:46021
> :LearnerHandler@575] - ******* GOODBYE /172.16.10.200:46021 ********
>

--
Diego de Oliveira
[EMAIL PROTECTED]
www.diegooliveira.com
Never argue with a fool -- people might not be able to tell the difference
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB