We have deployed zookeeper version 220.127.116.115976, with 3 zk servers in the quorum. The problem we are facing is that one zookeeper server in the quorum falls apart, and never becomes part of the cluster until we restart zookeeper server on that node.
Our interpretation from zookeeper logs on all nodes is as follows: (For simplicity assume S1=> zk server1, S2 => zk server2, S3 => zk server 3) Initially S3 is the leader while S1 and S2 are followers.
S2 hits 46 sec latency while fsyncing write ahead log and results in loss of connection with S3. S3 in turn prints following error message:
Unexpected exception causing shutdown while sock still open java.net.SocketTimeoutException: Read timed out Stack trace ******* GOODBYE /169.254.1.2:47647(S2) ********
S2 in this case closes connection with S3(leader) and shuts down follower with following log messages: Closing connection to leader, exception during packet send java.net.SocketException: Socket close Follower@194] - shutdown called java.lang.Exception: shutdown Follower
After this point S3 could never reestablish connection with S2 and leader election mechanism keeps failing. S3 now keeps printing following message repeatedly: Cannot open channel to 2 at election address /169.254.1.2:3888 java.net.ConnectException: Connection refused.
While S3 is in this state, S2 repeatedly keeps printing following message: INFO [NIOServerCxnFactory.AcceptThread:/0.0.0.0:2181 127.0.0.1:60667 Exception causing close of session 0x0: ZooKeeperServer not running Closed socket connection for client /127.0.0.1:60667 (no session established for client)
Leader election never completes successfully and causing S2 to fall apart from the quorum. S2 was out of quorum for almost 1 week.
While debugging this issue, we found out that both election and peer connection ports on S2 can't be telneted from any of the node (S1, S2, S3). Network connectivity is not the issue. Later, we restarted the ZK server S2 (service zookeeper-server restart) -- now we could telnet to both the ports and S2 joined the ensemble after a leader election attempt. Any idea what might be forcing S2 to get into a situation where it won't accept any connections on the leader election and peer connection ports?
Should I file a jira on this and upload all log files while submitting the jira as log files are close to 250MB each?
maxClientCnxns=50 # The number of milliseconds of each tick tickTime=2000 # The number of ticks that the initial # synchronization phase can take initLimit=10 # The number of ticks that can pass between # sending a request and getting an acknowledgement syncLimit=5 # the directory where the snapshot is stored. dataDir=/var/lib/zookeeper # the port at which the clients will connect clientPort=2181
sorry for the slow response. I can't figure out what might be going on here without the log files. The traces you see in S2 do not indicate any problem, as far as I see. It seems that you have a client running in S2 that tries to connect to that server. Since S2 hasn't been able to join a quorum, the server attending clients hasn't been started and the connection is rejected. Maybe, to start with, you could start by uploading the traces around the connection loss between S2 and S3 (say a couple of minutes before and after).
German. On Thu, Jan 23, 2014 at 8:42 PM, Deepak Jagtap <[EMAIL PROTECTED]>wrote:
Thanks for the followup! I have log files for all the servers and are quite big (greater than 25MB) hence could not upload send the log files through mail. Is it ok if I file a bug on this this and upload logs there?
Thanks & Regards, Deepak
On Sun, Jan 26, 2014 at 1:53 AM, German Blanco < [EMAIL PROTECTED]> wrote:
I don't see why it would be a problem for anybody. If this happens not to be a problem in ZooKeeper we can always close the bug case. On Mon, Jan 27, 2014 at 8:33 PM, Deepak Jagtap <[EMAIL PROTECTED]>wrote:
I went through the zookeeper logs again and it looks like a zookeeper bug to me. Leader election was initiated and it never completed as one zookeeper server went in zombie (hung) state. Please note that zookeeper was running all the nodes when this happened.
Thanks & Regards, Deepak On Mon, Jan 27, 2014 at 1:12 PM, Deepak Jagtap <[EMAIL PROTECTED]>wrote:
OK, that might be. I added a comment in the JIRA case that you created (ZOOKEEPER-1869, for others to know the reference) stating that at some point the logs say "leaving the listener" for the election in server 2 and it is not clear if the server restarts the listener from there. I think it is better to continue the discussion in the JIRA case and leave this thread here. On Tue, Jan 28, 2014 at 9:44 PM, Deepak Jagtap <[EMAIL PROTECTED]>wrote:
NEW: Monitor These Apps!
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by Sematext