Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # user >> ZOOKEEPER-900 / 901 / 1678


Copy link to this message
-
Re: ZOOKEEPER-900 / 901 / 1678
More digging!

See the attached screenshot from loggraph. You can see that server1 thinks
that it's the follower, but it can't connect to server3 because server3
doesn't think that it's leader yet. I believe this is because server3 is
blocked for 5 seconds (the connect timeout) while trying to connect to the
dead server (server2). It can't receive any notifications during this
period due to the synchronization on the connectOne() method. Because of
this, it has not yet spawned the Learner thread which opens up a server
socket to accept connections from followers.

Server 1 tries to connect to server3 5 times (in the connectToLeader()
method in Learner) in relatively quick succession (1 second sleeps
between), which all fail because the server socket is not yet up. At this
point, server 1 gives up and closes the Follower class, and goes back into
a LOOKING state, which results in another election occurring.

I can't think of anything that can be done without making the socket
establishment calls non blocking, which is not an insignificant change.

We can reduce the timeout for connection establishment, which should
greatly reduce the likelihood of the issue. The window of opportunity seems
occur when the leader is blocked trying to connect to the dead host (5
seconds), and the follower is attempting to connect to the leader. At a
minimum, the attempts to connect to the leader will take 4 seconds +
however long the connection attempts themselves take(connectToLeader()
method has 5 attempts to establish a connection to the leader, with a 1
second sleep in between them). So, given that we cannot increase the number
of attempts to communicate with the leader, or the sleep period between
attempts), the only option left to us is to minimize the time that the
leader can be blocked for while attempting to connect to the dead host.
Obviously reducing this number too much will result in other issues, so a
bit of fine tuning will be required.

Any other suggestions? I'm still hoping that I'm missing something simple!
cheers
Cam

On Thu, May 1, 2014 at 8:48 AM, Cameron McKenzie <[EMAIL PROTECTED]>wrote: