|
Jun Rao
2011-08-23, 21:58
Ted Dunning
2011-08-23, 23:07
Jun Rao
2011-08-23, 23:24
Mahadev Konar
2011-08-24, 03:42
Patrick Hunt
2011-08-25, 16:34
Jun Rao
2011-08-29, 04:34
Jun Rao
2011-08-29, 16:39
Fournier, Camille F.
2011-08-29, 17:50
Mahadev Konar
2011-08-29, 18:10
Fournier, Camille F.
2011-08-29, 19:38
Mahadev Konar
2011-08-29, 19:45
Benjamin Reed
2011-08-30, 23:45
Camille Fournier
2011-08-30, 23:50
Jun Rao
2011-08-31, 01:48
kishore g
2011-08-31, 17:02
Fournier, Camille F.
2011-09-01, 14:06
|
-
zk keeps disconnecting and reconnectingJun Rao 2011-08-23, 21:58
I have a ZK server cluster running on 4 nodes (version 3.3.3) and a few ZK
clients (version 3.3.0). After the clients have been running for a while, each of them starts to constantly disconnect and reconnect to the ZK server. On the client, I saw lots of entries like the following: 2011/08/23 14:42:06.579 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/08/23 14:42:06.579 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/08/23 14:42:06.581 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Session establishment complete on server esv4-app28.stg/172.18.98.101:12913, sessionid 0x331f77a1ed80004, negotiated timeout = 6000 2011/08/23 14:42:06.581 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) 2011/08/23 14:42:06.583 WARN [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Session 0x331f77a1ed80004 for server esv4-app28.stg/172.18.98.101:12913, unexpected error, closing socket connection and attempting reconnect java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1937) at java.lang.String.substring(String.java:1904) at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) 2011/08/23 14:42:06.683 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) 2011/08/23 14:42:07.510 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/08/23 14:42:07.511 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/08/23 14:42:07.512 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment complete on server esv4-app29.stg/172.18.98.89:12913, sessionid = 0x331f 77a1ed80004, negotiated timeout = 6000 2011/08/23 14:42:07.513 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) 2011/08/23 14:42:07.552 WARN [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x331f77a1ed80004 for server esv4-app29.stg/172.18.98.89:12913, unexpected error, clos ing socket connection and attempting reconnect java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1937) at java.lang.String.substring(String.java:1904) at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) 2011/08/23 14:42:07.653 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) On the ZK server, I saw lots of entries like these: 2011-08-23 14:34:28,802 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket connection from /172.18.98.95:345072011-08-23 14:34:28,803 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client attempting to renew session 0x231db9b5f3f010e at /172.18.98.95:34507 2011-08-23 14:34:28,803 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:Learner@103] - Revalidating client: 1581489222990563982011-08-23 14:34:28,803 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:12913:NIOServerCnxn@1580] - Established session 0x231db9b5f3f010e with negotiated timeout 6000 for client /172.18.98.95:345 072011-08-23 14:34:28,805 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x231db9b5f3 f010e, likely client has closed socket 2011-08-23 14:34:28,805 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for client /172.18.98.95:34507 which had sessionid 0x231db9b5 f3f010e 2011-08-23 14:34:29,512 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket connection from /172.18.98.94:588812011-08-23 14:34:29,512 - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client attempting to renew session 0x231db9b5f3f010c at /172.18.98.94:58881 2011-08-23 14:34:29,512 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:Learner@103] - Revalidating client: 1581489222990563962011-08-23 14:34:29,513 - INFO [QuorumPeer:/0:0:0:0:0:0:0:0:12913:NIOServerCnxn@1580] - Established session 0x231db9b5f3f010c with negotiated timeout 6000 for client /172.18.98.94:588 812011-08-23 14:34:29,514 - WARN [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@634] - EndOfStreamException: Unable to read additional data from client sessionid 0x231db9b5f3 f010c, likely client has closed socket 2011-08-23 14:34:29,514 - INFO [NIOServerCxn.Factory: 0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435] - Closed socket connection for client /172.18.98.94:58881 which had sessionid 0x231db9b5 f3f010c Anybody knows what could be the problem? A few other things that may help understand the problem: each ZK client is also connecting to another cluster of ZK servers in the same JVM and all the ZK connections use the ZK namespace (chroot). Thanks, Jun +
Jun Rao 2011-08-23, 21:58
-
Re: zk keeps disconnecting and reconnectingTed Dunning 2011-08-23, 23:07
Jun,
I don't see what your connection loss problem is, but I do have a question about why you are using 4 ZK servers. This doesn't increase availability or durability over having just three ZK servers. On Tue, Aug 23, 2011 at 2:58 PM, Jun Rao <[EMAIL PROTECTED]> wrote: > I have a ZK server cluster running on 4 nodes (version 3.3.3) and a few ZK > clients (version 3.3.0). After the clients have been running for a while, > each of them starts to constantly disconnect and reconnect to the ZK > server. > On the client, I saw lots of entries like the following: > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > established to esv4-app28.stg/172.18.98.101:12913, initiating session > 2011/08/23 14:42:06.581 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session establishment > complete on server esv4-app28.stg/172.18.98.101:12913, sessionid > 0x331f77a1ed80004, negotiated timeout = 6000 > 2011/08/23 14:42:06.581 INFO [ZkClient] [main-EventThread] [kafka] > zookeeper > state changed (SyncConnected) > 2011/08/23 14:42:06.583 WARN [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session 0x331f77a1ed80004 > for server esv4-app28.stg/172.18.98.101:12913, unexpected error, closing > socket connection and attempting reconnect > java.lang.StringIndexOutOfBoundsException: String index out of range: > -3 at java.lang.String.substring(String.java:1937) > at java.lang.String.substring(String.java:1904) > at > > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > 2011/08/23 14:42:06.683 INFO [ZkClient] [main-EventThread] [kafka] > zookeeper > state changed (Disconnected) > 2011/08/23 14:42:07.510 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection > to > server esv4-app29.stg/172.18.98.89:12913 > 2011/08/23 14:42:07.511 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > established to esv4-app29.stg/172.18.98.89:12913, initiating session > 2011/08/23 14:42:07.512 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment > complete on server esv4-app29.stg/172.18.98.89:12913, sessionid = 0x331f > 77a1ed80004, negotiated timeout = 6000 > 2011/08/23 14:42:07.513 INFO [ZkClient] [main-EventThread] [kafka] > zookeeper > state changed (SyncConnected) > 2011/08/23 14:42:07.552 WARN [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x331f77a1ed80004 > for server esv4-app29.stg/172.18.98.89:12913, unexpected error, clos > ing socket connection and attempting reconnect > java.lang.StringIndexOutOfBoundsException: String index out of range: > -3 at java.lang.String.substring(String.java:1937) > at java.lang.String.substring(String.java:1904) > at > > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > 2011/08/23 14:42:07.653 INFO [ZkClient] [main-EventThread] [kafka] > zookeeper > state changed (Disconnected) > > On the ZK server, I saw lots of entries like these: > 2011-08-23 14:34:28,802 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket > connection from /172.18.98.95:345072011-08-23 14:34:28,803 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client > attempting to renew session 0x231db9b5f3f010e at /172.18.98.95:34507 > 2011-08-23 14:34:28,803 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:12913:Learner@103] - Revalidating client: +
Ted Dunning 2011-08-23, 23:07
-
Re: zk keeps disconnecting and reconnectingJun Rao 2011-08-23, 23:24
Ted,
Yes, we could have used just 3 servers. It just happens that we are overlapping ZK servers on each of the 4 Kafka servers. Jun On Tue, Aug 23, 2011 at 4:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Jun, > > I don't see what your connection loss problem is, but I do have a question > about why you are using 4 ZK servers. This doesn't increase availability > or > durability over having just three ZK servers. > > On Tue, Aug 23, 2011 at 2:58 PM, Jun Rao <[EMAIL PROTECTED]> wrote: > > > I have a ZK server cluster running on 4 nodes (version 3.3.3) and a few > ZK > > clients (version 3.3.0). After the clients have been running for a while, > > each of them starts to constantly disconnect and reconnect to the ZK > > server. > > On the client, I saw lots of entries like the following: > > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > > to > > server esv4-app28.stg/172.18.98.101:12913 > > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > > established to esv4-app28.stg/172.18.98.101:12913, initiating session > > 2011/08/23 14:42:06.581 INFO [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session establishment > > complete on server esv4-app28.stg/172.18.98.101:12913, sessionid > > 0x331f77a1ed80004, negotiated timeout = 6000 > > 2011/08/23 14:42:06.581 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (SyncConnected) > > 2011/08/23 14:42:06.583 WARN [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session 0x331f77a1ed80004 > > for server esv4-app28.stg/172.18.98.101:12913, unexpected error, closing > > socket connection and attempting reconnect > > java.lang.StringIndexOutOfBoundsException: String index out of range: > > -3 at java.lang.String.substring(String.java:1937) > > at java.lang.String.substring(String.java:1904) > > at > > > > > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > > 2011/08/23 14:42:06.683 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (Disconnected) > > 2011/08/23 14:42:07.510 INFO [ClientCnxn] > > [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection > > to > > server esv4-app29.stg/172.18.98.89:12913 > > 2011/08/23 14:42:07.511 INFO [ClientCnxn] > > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > > established to esv4-app29.stg/172.18.98.89:12913, initiating session > > 2011/08/23 14:42:07.512 INFO [ClientCnxn] > > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment > > complete on server esv4-app29.stg/172.18.98.89:12913, sessionid = 0x331f > > 77a1ed80004, negotiated timeout = 6000 > > 2011/08/23 14:42:07.513 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (SyncConnected) > > 2011/08/23 14:42:07.552 WARN [ClientCnxn] > > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x331f77a1ed80004 > > for server esv4-app29.stg/172.18.98.89:12913, unexpected error, clos > > ing socket connection and attempting reconnect > > java.lang.StringIndexOutOfBoundsException: String index out of range: > > -3 at java.lang.String.substring(String.java:1937) > > at java.lang.String.substring(String.java:1904) > > at > > > > > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > > 2011/08/23 14:42:07.653 INFO [ZkClient] [main-EventThread] [kafka] > > zookeeper > > state changed (Disconnected) > > > > On the ZK server, I saw lots of entries like these: > > 2011-08-23 14:34:28,802 - INFO [NIOServerCxn.Factory: +
Jun Rao 2011-08-23, 23:24
-
Re: zk keeps disconnecting and reconnectingMahadev Konar 2011-08-24, 03:42
This looks very strange to me. Is it possible that there is version
issue with ZK server and client? Mostly looks like the client is getting garbled data from the server somehow. I cant imagine how that would happen. Is it possible to run one of the servers on tracemode and see the tracelog on whats going on? mahadev On Tue, Aug 23, 2011 at 2:58 PM, Jun Rao <[EMAIL PROTECTED]> wrote: > I have a ZK server cluster running on 4 nodes (version 3.3.3) and a few ZK > clients (version 3.3.0). After the clients have been running for a while, > each of them starts to constantly disconnect and reconnect to the ZK server. > On the client, I saw lots of entries like the following: > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > established to esv4-app28.stg/172.18.98.101:12913, initiating session > 2011/08/23 14:42:06.581 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session establishment > complete on server esv4-app28.stg/172.18.98.101:12913, sessionid > 0x331f77a1ed80004, negotiated timeout = 6000 > 2011/08/23 14:42:06.581 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (SyncConnected) > 2011/08/23 14:42:06.583 WARN [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session 0x331f77a1ed80004 > for server esv4-app28.stg/172.18.98.101:12913, unexpected error, closing > socket connection and attempting reconnect > java.lang.StringIndexOutOfBoundsException: String index out of range: > -3 at java.lang.String.substring(String.java:1937) > at java.lang.String.substring(String.java:1904) > at > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > 2011/08/23 14:42:06.683 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (Disconnected) > 2011/08/23 14:42:07.510 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to > server esv4-app29.stg/172.18.98.89:12913 > 2011/08/23 14:42:07.511 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > established to esv4-app29.stg/172.18.98.89:12913, initiating session > 2011/08/23 14:42:07.512 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment > complete on server esv4-app29.stg/172.18.98.89:12913, sessionid = 0x331f > 77a1ed80004, negotiated timeout = 6000 > 2011/08/23 14:42:07.513 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (SyncConnected) > 2011/08/23 14:42:07.552 WARN [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x331f77a1ed80004 > for server esv4-app29.stg/172.18.98.89:12913, unexpected error, clos > ing socket connection and attempting reconnect > java.lang.StringIndexOutOfBoundsException: String index out of range: > -3 at java.lang.String.substring(String.java:1937) > at java.lang.String.substring(String.java:1904) > at > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > 2011/08/23 14:42:07.653 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (Disconnected) > > On the ZK server, I saw lots of entries like these: > 2011-08-23 14:34:28,802 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket > connection from /172.18.98.95:345072011-08-23 14:34:28,803 - INFO > [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770] - Client > attempting to renew session 0x231db9b5f3f010e at /172.18.98.95:34507 +
Mahadev Konar 2011-08-24, 03:42
-
Re: zk keeps disconnecting and reconnectingPatrick Hunt 2011-08-25, 16:34
The client seeing the problem in this case is 3.3.0, I see this based
on the line number in the stack trace not matching up with 3.3.3, with 3.3.0 it's this line: event.setPath(serverPath.substring(chrootPath.length())); so for some reason your chroot path is negative in length? That's just not possible (string.length() should never return negative). What JVM are you using? What's your client connect string look like? Patrick On Tue, Aug 23, 2011 at 2:58 PM, Jun Rao <[EMAIL PROTECTED]> wrote: > I have a ZK server cluster running on 4 nodes (version 3.3.3) and a few ZK > clients (version 3.3.0). After the clients have been running for a while, > each of them starts to constantly disconnect and reconnect to the ZK server. > On the client, I saw lots of entries like the following: > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/23 14:42:06.579 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > established to esv4-app28.stg/172.18.98.101:12913, initiating session > 2011/08/23 14:42:06.581 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session establishment > complete on server esv4-app28.stg/172.18.98.101:12913, sessionid > 0x331f77a1ed80004, negotiated timeout = 6000 > 2011/08/23 14:42:06.581 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (SyncConnected) > 2011/08/23 14:42:06.583 WARN [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Session 0x331f77a1ed80004 > for server esv4-app28.stg/172.18.98.101:12913, unexpected error, closing > socket connection and attempting reconnect > java.lang.StringIndexOutOfBoundsException: String index out of range: > -3 at java.lang.String.substring(String.java:1937) > at java.lang.String.substring(String.java:1904) > at > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > 2011/08/23 14:42:06.683 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (Disconnected) > 2011/08/23 14:42:07.510 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to > server esv4-app29.stg/172.18.98.89:12913 > 2011/08/23 14:42:07.511 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > established to esv4-app29.stg/172.18.98.89:12913, initiating session > 2011/08/23 14:42:07.512 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment > complete on server esv4-app29.stg/172.18.98.89:12913, sessionid = 0x331f > 77a1ed80004, negotiated timeout = 6000 > 2011/08/23 14:42:07.513 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (SyncConnected) > 2011/08/23 14:42:07.552 WARN [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x331f77a1ed80004 > for server esv4-app29.stg/172.18.98.89:12913, unexpected error, clos > ing socket connection and attempting reconnect > java.lang.StringIndexOutOfBoundsException: String index out of range: > -3 at java.lang.String.substring(String.java:1937) > at java.lang.String.substring(String.java:1904) > at > org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:753) > at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:840) > at > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1089) > 2011/08/23 14:42:07.653 INFO [ZkClient] [main-EventThread] [kafka] zookeeper > state changed (Disconnected) > > On the ZK server, I saw lots of entries like these: > 2011-08-23 14:34:28,802 - INFO [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:12913:NIOServerCnxn$Factory@251] - Accepted socket > connection from /172.18.98.95:345072011-08-23 14:34:28,803 - INFO +
Patrick Hunt 2011-08-25, 16:34
-
Re: zk keeps disconnecting and reconnectingJun Rao 2011-08-29, 04:34
We cleaned up all ZK server data and restarted both the servers and the
clients. We also upgraded the client to 3.3.3. After running for a day and a half, the same weird reconnect issue showed up in one of the clients. Our ZK connection string is esv4-app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/kafka-tracking. We are on java 1_6_0_21 on RedHat Linux. Note that our ZK client has been running fine until we upgraded the client code recently. The new version makes one extra ZK connection to the same ZK cluster. Here are the log entries and ZK client keeps connecting and disconnecting from each of the 4 ZK servers. 2011/08/26 10:58:39.864 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to server esv4-app27.stg/172.18.98.88:12913 2011/08/26 10:58:39.865 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Socket connection established to esv4-app27.stg/172.18.98.88:12913, initiating session 2011/08/26 10:58:39.867 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Session establishment complete on server esv4-app27.stg/172.18.98.88:12913, sessionid 0x131fddd84bc0006, negotiated timeout = 6000 2011/08/26 10:58:39.867 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) 2011/08/26 10:58:39.868 WARN [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Session 0x131fddd84bc0006 for server esv4-app27.stg/172.18.98.88:12913, unexpected error, closing socket connection and attempting reconnect java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1937) at java.lang.String.substring(String.java:1904) at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:794) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:881) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130) 2011/08/26 10:58:39.969 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) 2011/08/26 10:58:40.276 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/08/26 10:58:40.276 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/08/26 10:58:40.278 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Session establishment complete on server esv4-app29.stg/172.18.98.89:12913, sessionid 0x131fddd84bc0006, negotiated timeout = 6000 2011/08/26 10:58:40.278 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) 2011/08/26 10:58:40.279 WARN [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Session 0x131fddd84bc0006 for server esv4-app29.stg/172.18.98.89:12913, unexpected error, closing socket connection and attempting reconnect java.lang.StringIndexOutOfBoundsException: String index out of range: -3 at java.lang.String.substring(String.java:1937) at java.lang.String.substring(String.java:1904) at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:794) at org.apache.zookeeper.ClientCnxn$SendThread.doIO(ClientCnxn.java:881) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1130) 2011/08/26 10:58:40.380 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) 2011/08/26 10:58:40.515 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 Thanks, Jun On Thu, Aug 25, 2011 at 9:34 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > The client seeing the problem in this case is 3.3.0, I see this based > on the line number in the stack trace not matching up with 3.3.3, with > 3.3.0 it's this line: > > event.setPath(serverPath.substring(chrootPath.length())); +
Jun Rao 2011-08-29, 04:34
-
Re: zk keeps disconnecting and reconnectingJun Rao 2011-08-29, 16:39
What's the impact of ZOOKEEPER-961? If it shows up, does that mean the
client won't get any watcher events afterwards? If so, this sounds like a blocker for 3.4 release to me. What's the temporary solution for 3.3.3? Also, for the very first time that the ZK client gets disconnected, I saw the following entry in the log. It seems that the client can't ping the server for 4 seconds. The ZK server was up at that time and the load was minimal. What could cause the time out? Client GC pauses? 2011/08/26 10:58:22.306 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, have not heard from server in 4001ms for sessionid 0x131f ddd84bc0006, closing socket connection and attempting reconnect Thanks, Jun On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: > Fournier, Camille F.: > > Did anyone ever check resetting watches at client reconnect on a client > > with a chroot? Looking at the code, we store the watches associated with > > the non-chroot path, but they are set by the original request prepending > > chroot to the request. However, it looks like the SetWatches request on > > reconnect just calls get on the various watch lists from ZooKeeper, which > > don't have the prepended chroot. > > > > I haven't written a test but I would bet dollars to donuts this is the > > problem. > > > > C > seems to be this: > ZOOKEEPER-961, ZOOKEEPER-1091 > > Regards, > > Thomas Koch, http://www.koch.ro > +
Jun Rao 2011-08-29, 16:39
-
RE: zk keeps disconnecting and reconnectingFournier, Camille F. 2011-08-29, 17:50
Well, it causes the problem you are seeing. If you set any watchers with a chroot and then your client gets disconnected with these watches outstanding, when you reconnect you will try to reset them and they are probably on paths that don't exist (if you are creating everything under path /kafka-tracking). So you get a notification about the watches immediately after resetting them, which causes the string out of bounds exception.
The only fix is to disable auto watch reset, and then have your own client reset watches when it gets a reconnected event. I suspect it would be easier for you to take a shot at fixing the bug than to rewrite your client to handle this. Thomas provided a patch with tests that presumably show the error, so all you need is a fix to make them pass. C -----Original Message----- From: Jun Rao [mailto:[EMAIL PROTECTED]] Sent: Monday, August 29, 2011 12:39 PM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: zk keeps disconnecting and reconnecting What's the impact of ZOOKEEPER-961? If it shows up, does that mean the client won't get any watcher events afterwards? If so, this sounds like a blocker for 3.4 release to me. What's the temporary solution for 3.3.3? Also, for the very first time that the ZK client gets disconnected, I saw the following entry in the log. It seems that the client can't ping the server for 4 seconds. The ZK server was up at that time and the load was minimal. What could cause the time out? Client GC pauses? 2011/08/26 10:58:22.306 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, have not heard from server in 4001ms for sessionid 0x131f ddd84bc0006, closing socket connection and attempting reconnect Thanks, Jun On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: > Fournier, Camille F.: > > Did anyone ever check resetting watches at client reconnect on a client > > with a chroot? Looking at the code, we store the watches associated with > > the non-chroot path, but they are set by the original request prepending > > chroot to the request. However, it looks like the SetWatches request on > > reconnect just calls get on the various watch lists from ZooKeeper, which > > don't have the prepended chroot. > > > > I haven't written a test but I would bet dollars to donuts this is the > > problem. > > > > C > seems to be this: > ZOOKEEPER-961, ZOOKEEPER-1091 > > Regards, > > Thomas Koch, http://www.koch.ro > +
Fournier, Camille F. 2011-08-29, 17:50
-
Re: zk keeps disconnecting and reconnectingMahadev Konar 2011-08-29, 18:10
Camille,
Do you think we should put the fix in 3.3.4? I think 3.4 might take a while to stabilize, so 3.3.4 would be a good release to get this in. Thoughts? mahadev On Aug 29, 2011, at 10:50 AM, Fournier, Camille F. wrote: > Well, it causes the problem you are seeing. If you set any watchers with a chroot and then your client gets disconnected with these watches outstanding, when you reconnect you will try to reset them and they are probably on paths that don't exist (if you are creating everything under path /kafka-tracking). So you get a notification about the watches immediately after resetting them, which causes the string out of bounds exception. > > The only fix is to disable auto watch reset, and then have your own client reset watches when it gets a reconnected event. I suspect it would be easier for you to take a shot at fixing the bug than to rewrite your client to handle this. Thomas provided a patch with tests that presumably show the error, so all you need is a fix to make them pass. > > > C > > -----Original Message----- > From: Jun Rao [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 29, 2011 12:39 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: zk keeps disconnecting and reconnecting > > What's the impact of ZOOKEEPER-961? If it shows up, does that mean the > client won't get any watcher events afterwards? If so, this sounds like a > blocker for 3.4 release to me. What's the temporary solution for 3.3.3? > > Also, for the very first time that the ZK client gets disconnected, I saw > the following entry in the log. It seems that the client can't ping the > server for 4 seconds. The ZK server was up at that time and the load was > minimal. What could cause the time out? Client GC pauses? > > 2011/08/26 10:58:22.306 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, > have not heard from server in 4001ms for sessionid 0x131f > ddd84bc0006, closing socket connection and attempting reconnect > > Thanks, > > Jun > > On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: > >> Fournier, Camille F.: >>> Did anyone ever check resetting watches at client reconnect on a client >>> with a chroot? Looking at the code, we store the watches associated with >>> the non-chroot path, but they are set by the original request prepending >>> chroot to the request. However, it looks like the SetWatches request on >>> reconnect just calls get on the various watch lists from ZooKeeper, which >>> don't have the prepended chroot. >>> >>> I haven't written a test but I would bet dollars to donuts this is the >>> problem. >>> >>> C >> seems to be this: >> ZOOKEEPER-961, ZOOKEEPER-1091 >> >> Regards, >> >> Thomas Koch, http://www.koch.ro >> +
Mahadev Konar 2011-08-29, 18:10
-
RE: zk keeps disconnecting and reconnectingFournier, Camille F. 2011-08-29, 19:38
Yeah let's put it in 3.3.4. What's the plan for 3.4? I thought we were almost ready for that.
C -----Original Message----- From: Mahadev Konar [mailto:[EMAIL PROTECTED]] Sent: Monday, August 29, 2011 2:10 PM To: [EMAIL PROTECTED] Subject: Re: zk keeps disconnecting and reconnecting Camille, Do you think we should put the fix in 3.3.4? I think 3.4 might take a while to stabilize, so 3.3.4 would be a good release to get this in. Thoughts? mahadev On Aug 29, 2011, at 10:50 AM, Fournier, Camille F. wrote: > Well, it causes the problem you are seeing. If you set any watchers with a chroot and then your client gets disconnected with these watches outstanding, when you reconnect you will try to reset them and they are probably on paths that don't exist (if you are creating everything under path /kafka-tracking). So you get a notification about the watches immediately after resetting them, which causes the string out of bounds exception. > > The only fix is to disable auto watch reset, and then have your own client reset watches when it gets a reconnected event. I suspect it would be easier for you to take a shot at fixing the bug than to rewrite your client to handle this. Thomas provided a patch with tests that presumably show the error, so all you need is a fix to make them pass. > > > C > > -----Original Message----- > From: Jun Rao [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 29, 2011 12:39 PM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Subject: Re: zk keeps disconnecting and reconnecting > > What's the impact of ZOOKEEPER-961? If it shows up, does that mean the > client won't get any watcher events afterwards? If so, this sounds like a > blocker for 3.4 release to me. What's the temporary solution for 3.3.3? > > Also, for the very first time that the ZK client gets disconnected, I saw > the following entry in the log. It seems that the client can't ping the > server for 4 seconds. The ZK server was up at that time and the load was > minimal. What could cause the time out? Client GC pauses? > > 2011/08/26 10:58:22.306 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, > have not heard from server in 4001ms for sessionid 0x131f > ddd84bc0006, closing socket connection and attempting reconnect > > Thanks, > > Jun > > On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: > >> Fournier, Camille F.: >>> Did anyone ever check resetting watches at client reconnect on a client >>> with a chroot? Looking at the code, we store the watches associated with >>> the non-chroot path, but they are set by the original request prepending >>> chroot to the request. However, it looks like the SetWatches request on >>> reconnect just calls get on the various watch lists from ZooKeeper, which >>> don't have the prepended chroot. >>> >>> I haven't written a test but I would bet dollars to donuts this is the >>> problem. >>> >>> C >> seems to be this: >> ZOOKEEPER-961, ZOOKEEPER-1091 >> >> Regards, >> >> Thomas Koch, http://www.koch.ro >> +
Fournier, Camille F. 2011-08-29, 19:38
-
Re: zk keeps disconnecting and reconnectingMahadev Konar 2011-08-29, 19:45
Camille,
I will be cutting a branch this week some time. Just waiting for ZOOKEEPER-999 to get in. Other than that, we are probably 2 weeks away from the release. 3.3.4 would be good even if we have 3.4 coming in a week or 2. Thats because 3.4.0 might take sometime to stabilize and 3.3.4 would be a good stable release (recommended for production use), until 3.4 stabilizes. Does that sound reasonable? Others? thanks mahadev On Aug 29, 2011, at 12:38 PM, Fournier, Camille F. wrote: > Yeah let's put it in 3.3.4. What's the plan for 3.4? I thought we were almost ready for that. > > C > > -----Original Message----- > From: Mahadev Konar [mailto:[EMAIL PROTECTED]] > Sent: Monday, August 29, 2011 2:10 PM > To: [EMAIL PROTECTED] > Subject: Re: zk keeps disconnecting and reconnecting > > Camille, > Do you think we should put the fix in 3.3.4? I think 3.4 might take a while to stabilize, so 3.3.4 would be a good release to get this in. > > Thoughts? > > mahadev > > On Aug 29, 2011, at 10:50 AM, Fournier, Camille F. wrote: > >> Well, it causes the problem you are seeing. If you set any watchers with a chroot and then your client gets disconnected with these watches outstanding, when you reconnect you will try to reset them and they are probably on paths that don't exist (if you are creating everything under path /kafka-tracking). So you get a notification about the watches immediately after resetting them, which causes the string out of bounds exception. >> >> The only fix is to disable auto watch reset, and then have your own client reset watches when it gets a reconnected event. I suspect it would be easier for you to take a shot at fixing the bug than to rewrite your client to handle this. Thomas provided a patch with tests that presumably show the error, so all you need is a fix to make them pass. >> >> >> C >> >> -----Original Message----- >> From: Jun Rao [mailto:[EMAIL PROTECTED]] >> Sent: Monday, August 29, 2011 12:39 PM >> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] >> Subject: Re: zk keeps disconnecting and reconnecting >> >> What's the impact of ZOOKEEPER-961? If it shows up, does that mean the >> client won't get any watcher events afterwards? If so, this sounds like a >> blocker for 3.4 release to me. What's the temporary solution for 3.3.3? >> >> Also, for the very first time that the ZK client gets disconnected, I saw >> the following entry in the log. It seems that the client can't ping the >> server for 4 seconds. The ZK server was up at that time and the load was >> minimal. What could cause the time out? Client GC pauses? >> >> 2011/08/26 10:58:22.306 INFO [ClientCnxn] >> [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, >> have not heard from server in 4001ms for sessionid 0x131f >> ddd84bc0006, closing socket connection and attempting reconnect >> >> Thanks, >> >> Jun >> >> On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: >> >>> Fournier, Camille F.: >>>> Did anyone ever check resetting watches at client reconnect on a client >>>> with a chroot? Looking at the code, we store the watches associated with >>>> the non-chroot path, but they are set by the original request prepending >>>> chroot to the request. However, it looks like the SetWatches request on >>>> reconnect just calls get on the various watch lists from ZooKeeper, which >>>> don't have the prepended chroot. >>>> >>>> I haven't written a test but I would bet dollars to donuts this is the >>>> problem. >>>> >>>> C >>> seems to be this: >>> ZOOKEEPER-961, ZOOKEEPER-1091 >>> >>> Regards, >>> >>> Thomas Koch, http://www.koch.ro >>> > +
Mahadev Konar 2011-08-29, 19:45
-
Re: zk keeps disconnecting and reconnectingBenjamin Reed 2011-08-30, 23:45
i have been wondering about 3.3.4. there are so many great bugs that
were fixed in 3.4.0 that it isn't clear what we should put into 3.3.4 or if we should even do it. the chroot bug does seem like a good one to do a 3.3.4 release for. ben On Mon, Aug 29, 2011 at 12:45 PM, Mahadev Konar <[EMAIL PROTECTED]> wrote: > Camille, > I will be cutting a branch this week some time. Just waiting for ZOOKEEPER-999 to get in. Other than that, we are probably 2 weeks away from the release. > 3.3.4 would be good even if we have 3.4 coming in a week or 2. Thats because 3.4.0 might take sometime to stabilize and 3.3.4 would be a good stable release (recommended for production use), until 3.4 stabilizes. > Does that sound reasonable? Others? > > thanks > mahadev > > On Aug 29, 2011, at 12:38 PM, Fournier, Camille F. wrote: > >> Yeah let's put it in 3.3.4. What's the plan for 3.4? I thought we were almost ready for that. >> >> C >> >> -----Original Message----- >> From: Mahadev Konar [mailto:[EMAIL PROTECTED]] >> Sent: Monday, August 29, 2011 2:10 PM >> To: [EMAIL PROTECTED] >> Subject: Re: zk keeps disconnecting and reconnecting >> >> Camille, >> Do you think we should put the fix in 3.3.4? I think 3.4 might take a while to stabilize, so 3.3.4 would be a good release to get this in. >> >> Thoughts? >> >> mahadev >> >> On Aug 29, 2011, at 10:50 AM, Fournier, Camille F. wrote: >> >>> Well, it causes the problem you are seeing. If you set any watchers with a chroot and then your client gets disconnected with these watches outstanding, when you reconnect you will try to reset them and they are probably on paths that don't exist (if you are creating everything under path /kafka-tracking). So you get a notification about the watches immediately after resetting them, which causes the string out of bounds exception. >>> >>> The only fix is to disable auto watch reset, and then have your own client reset watches when it gets a reconnected event. I suspect it would be easier for you to take a shot at fixing the bug than to rewrite your client to handle this. Thomas provided a patch with tests that presumably show the error, so all you need is a fix to make them pass. >>> >>> >>> C >>> >>> -----Original Message----- >>> From: Jun Rao [mailto:[EMAIL PROTECTED]] >>> Sent: Monday, August 29, 2011 12:39 PM >>> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] >>> Subject: Re: zk keeps disconnecting and reconnecting >>> >>> What's the impact of ZOOKEEPER-961? If it shows up, does that mean the >>> client won't get any watcher events afterwards? If so, this sounds like a >>> blocker for 3.4 release to me. What's the temporary solution for 3.3.3? >>> >>> Also, for the very first time that the ZK client gets disconnected, I saw >>> the following entry in the log. It seems that the client can't ping the >>> server for 4 seconds. The ZK server was up at that time and the load was >>> minimal. What could cause the time out? Client GC pauses? >>> >>> 2011/08/26 10:58:22.306 INFO [ClientCnxn] >>> [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, >>> have not heard from server in 4001ms for sessionid 0x131f >>> ddd84bc0006, closing socket connection and attempting reconnect >>> >>> Thanks, >>> >>> Jun >>> >>> On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: >>> >>>> Fournier, Camille F.: >>>>> Did anyone ever check resetting watches at client reconnect on a client >>>>> with a chroot? Looking at the code, we store the watches associated with >>>>> the non-chroot path, but they are set by the original request prepending >>>>> chroot to the request. However, it looks like the SetWatches request on >>>>> reconnect just calls get on the various watch lists from ZooKeeper, which >>>>> don't have the prepended chroot. >>>>> >>>>> I haven't written a test but I would bet dollars to donuts this is the >>>>> problem. >>>>> >>>>> C >>>> seems to be this: >>>> ZOOKEEPER-961, ZOOKEEPER-1091 >>>> >>>> Regards, >>>> >> +
Benjamin Reed 2011-08-30, 23:45
-
Re: zk keeps disconnecting and reconnectingCamille Fournier 2011-08-30, 23:50
If we think 3.4 is going to be stable I am cool with just doing a 3.4
release. If we are concerned with the stability of 3.4 though, we need to go through the bugfixes that were put in and figure out what should be in 3.3.4. A lot of them got put into both I think (1046 for example). C On Tue, Aug 30, 2011 at 7:45 PM, Benjamin Reed <[EMAIL PROTECTED]> wrote: > i have been wondering about 3.3.4. there are so many great bugs that > were fixed in 3.4.0 that it isn't clear what we should put into 3.3.4 > or if we should even do it. the chroot bug does seem like a good one > to do a 3.3.4 release for. > > ben > > On Mon, Aug 29, 2011 at 12:45 PM, Mahadev Konar <[EMAIL PROTECTED]> wrote: >> Camille, >> I will be cutting a branch this week some time. Just waiting for ZOOKEEPER-999 to get in. Other than that, we are probably 2 weeks away from the release. >> 3.3.4 would be good even if we have 3.4 coming in a week or 2. Thats because 3.4.0 might take sometime to stabilize and 3.3.4 would be a good stable release (recommended for production use), until 3.4 stabilizes. >> Does that sound reasonable? Others? >> >> thanks >> mahadev >> >> On Aug 29, 2011, at 12:38 PM, Fournier, Camille F. wrote: >> >>> Yeah let's put it in 3.3.4. What's the plan for 3.4? I thought we were almost ready for that. >>> >>> C >>> >>> -----Original Message----- >>> From: Mahadev Konar [mailto:[EMAIL PROTECTED]] >>> Sent: Monday, August 29, 2011 2:10 PM >>> To: [EMAIL PROTECTED] >>> Subject: Re: zk keeps disconnecting and reconnecting >>> >>> Camille, >>> Do you think we should put the fix in 3.3.4? I think 3.4 might take a while to stabilize, so 3.3.4 would be a good release to get this in. >>> >>> Thoughts? >>> >>> mahadev >>> >>> On Aug 29, 2011, at 10:50 AM, Fournier, Camille F. wrote: >>> >>>> Well, it causes the problem you are seeing. If you set any watchers with a chroot and then your client gets disconnected with these watches outstanding, when you reconnect you will try to reset them and they are probably on paths that don't exist (if you are creating everything under path /kafka-tracking). So you get a notification about the watches immediately after resetting them, which causes the string out of bounds exception. >>>> >>>> The only fix is to disable auto watch reset, and then have your own client reset watches when it gets a reconnected event. I suspect it would be easier for you to take a shot at fixing the bug than to rewrite your client to handle this. Thomas provided a patch with tests that presumably show the error, so all you need is a fix to make them pass. >>>> >>>> >>>> C >>>> >>>> -----Original Message----- >>>> From: Jun Rao [mailto:[EMAIL PROTECTED]] >>>> Sent: Monday, August 29, 2011 12:39 PM >>>> To: [EMAIL PROTECTED]; [EMAIL PROTECTED] >>>> Subject: Re: zk keeps disconnecting and reconnecting >>>> >>>> What's the impact of ZOOKEEPER-961? If it shows up, does that mean the >>>> client won't get any watcher events afterwards? If so, this sounds like a >>>> blocker for 3.4 release to me. What's the temporary solution for 3.3.3? >>>> >>>> Also, for the very first time that the ZK client gets disconnected, I saw >>>> the following entry in the log. It seems that the client can't ping the >>>> server for 4 seconds. The ZK server was up at that time and the load was >>>> minimal. What could cause the time out? Client GC pauses? >>>> >>>> 2011/08/26 10:58:22.306 INFO [ClientCnxn] >>>> [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, >>>> have not heard from server in 4001ms for sessionid 0x131f >>>> ddd84bc0006, closing socket connection and attempting reconnect >>>> >>>> Thanks, >>>> >>>> Jun >>>> >>>> On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch <[EMAIL PROTECTED]> wrote: >>>> >>>>> Fournier, Camille F.: >>>>>> Did anyone ever check resetting watches at client reconnect on a client >>>>>> with a chroot? Looking at the code, we store the watches associated with >>>>>> the non-chroot path, but they are set by the original request prepending +
Camille Fournier 2011-08-30, 23:50
-
Re: zk keeps disconnecting and reconnectingJun Rao 2011-08-31, 01:48
I was also wondering why our clients get disconnected in the first place
since the ZK servers are all up. The following are the logs when the first disconnect happens. Does anyone know why the client can't seem to connect to most servers? Also, is "Session 0x1320765b0ac002e for server nulll" normal? Thanks, 2011/08/29 07:33:51.824 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002f, likely server h as closed socket, closing socket connection and attempting reconnect2011/08/29 07:33:51.824 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002e, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:51.990 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected)2011/08/29 07:33:52.019 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) 2011/08/29 07:33:52.092 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/08/29 07:33:52.093 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/08/29 07:33:52.094 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002f, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:52.652 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/08/29 07:33:52.652 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session2011/08/29 07:33:53.075 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002e, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:53.108 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/08/29 07:33:53.108 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/08/29 07:33:53.109 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002f, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:53.577 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/08/29 07:33:53.578 WARN [ClientCnxn] [main-SendThread(esv4-app30.stg:12913)] [kafka] Session 0x1320765b0ac002e for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2011/08/29 07:33:53.683 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/08/29 07:33:53.683 WARN [ClientCnxn] [main-SendThread(esv4-app30.stg:12913)] [kafka] Session 0x1320765b0ac002f for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) Jun On Mon, Aug 29, 2011 at 10:50 AM, Fournier, Camille F. < [EMAIL PROTECTED]> wrote: +
Jun Rao 2011-08-31, 01:48
-
Re: zk keeps disconnecting and reconnectingkishore g 2011-08-31, 17:02
Here is a simple test case that reproduces this error.
public void testChroot() throws Exception { Watcher watcher = new Watcher() { @Override public void process(WatchedEvent event) { System.out.println("Event:" + event); } }; ZooKeeper zk = new ZooKeeper("localhost:2181/foo", 6000, watcher); //uncommenting this line will not cause infinite connect/disconnect //zk.create("/", new byte[0], Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); zk.exists("/", true); System.out.println("Stop the server and restart it when you see this message"); Thread.currentThread().join(); } As pointed out earlier in this thread, setting an watch on a non-existent path triggers this. Is some one working on a patch for 961 and the issues described in this thread. Any pointers on what needs to be fixed for both the issues? I can take a look and submit a patch if I can fix it. thanks, Kishore G On Tue, Aug 30, 2011 at 6:48 PM, Jun Rao <[EMAIL PROTECTED]> wrote: > I was also wondering why our clients get disconnected in the first place > since the ZK servers are all up. The following are the logs when the first > disconnect happens. Does anyone know why the client can't seem to connect > to > most servers? Also, is "Session 0x1320765b0ac002e for server nulll" > normal? > Thanks, > > 2011/08/29 07:33:51.824 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002f, likely server h > as closed socket, closing socket connection and attempting > reconnect2011/08/29 07:33:51.824 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002e, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:51.990 INFO [ZkClient] [main-EventThread] [kafka] > zookeeper > state changed (Disconnected)2011/08/29 07:33:52.019 INFO [ZkClient] > [main-EventThread] [kafka] zookeeper state changed (Disconnected) > 2011/08/29 07:33:52.092 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > to > server esv4-app29.stg/172.18.98.89:12913 > 2011/08/29 07:33:52.093 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > established to esv4-app29.stg/172.18.98.89:12913, initiating session > 2011/08/29 07:33:52.094 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002f, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:52.652 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/29 07:33:52.652 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > established to esv4-app28.stg/172.18.98.101:12913, initiating > session2011/08/29 07:33:53.075 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002e, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:53.108 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Opening socket connection > to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/29 07:33:53.108 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > established to esv4-app28.stg/172.18.98.101:12913, initiating session > 2011/08/29 07:33:53.109 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002f, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:53.577 INFO [ClientCnxn] +
kishore g 2011-08-31, 17:02
-
RE: zk keeps disconnecting and reconnectingFournier, Camille F. 2011-09-01, 14:06
There's a test also attached to 961.
It would be nice if one of you that is having trouble with the bug would submit a fix. Is anyone working on it? Thanks, C -----Original Message----- From: kishore g [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 31, 2011 1:02 PM To: [EMAIL PROTECTED] Subject: Re: zk keeps disconnecting and reconnecting Here is a simple test case that reproduces this error. public void testChroot() throws Exception { Watcher watcher = new Watcher() { @Override public void process(WatchedEvent event) { System.out.println("Event:" + event); } }; ZooKeeper zk = new ZooKeeper("localhost:2181/foo", 6000, watcher); //uncommenting this line will not cause infinite connect/disconnect //zk.create("/", new byte[0], Ids.OPEN_ACL_UNSAFE, CreateMode.PERSISTENT); zk.exists("/", true); System.out.println("Stop the server and restart it when you see this message"); Thread.currentThread().join(); } As pointed out earlier in this thread, setting an watch on a non-existent path triggers this. Is some one working on a patch for 961 and the issues described in this thread. Any pointers on what needs to be fixed for both the issues? I can take a look and submit a patch if I can fix it. thanks, Kishore G On Tue, Aug 30, 2011 at 6:48 PM, Jun Rao <[EMAIL PROTECTED]> wrote: > I was also wondering why our clients get disconnected in the first place > since the ZK servers are all up. The following are the logs when the first > disconnect happens. Does anyone know why the client can't seem to connect > to > most servers? Also, is "Session 0x1320765b0ac002e for server nulll" > normal? > Thanks, > > 2011/08/29 07:33:51.824 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002f, likely server h > as closed socket, closing socket connection and attempting > reconnect2011/08/29 07:33:51.824 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002e, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:51.990 INFO [ZkClient] [main-EventThread] [kafka] > zookeeper > state changed (Disconnected)2011/08/29 07:33:52.019 INFO [ZkClient] > [main-EventThread] [kafka] zookeeper state changed (Disconnected) > 2011/08/29 07:33:52.092 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > to > server esv4-app29.stg/172.18.98.89:12913 > 2011/08/29 07:33:52.093 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection > established to esv4-app29.stg/172.18.98.89:12913, initiating session > 2011/08/29 07:33:52.094 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002f, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:52.652 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection > to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/29 07:33:52.652 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection > established to esv4-app28.stg/172.18.98.101:12913, initiating > session2011/08/29 07:33:53.075 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional > data from server sessionid 0x1320765b0ac002e, likely server h > as closed socket, closing socket connection and attempting reconnect > 2011/08/29 07:33:53.108 INFO [ClientCnxn] > [main-SendThread(esv4-app29.stg:12913)] [kafka] Opening socket connection > to > server esv4-app28.stg/172.18.98.101:12913 > 2011/08/29 07:33:53.108 INFO [ClientCnxn] > [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection +
Fournier, Camille F. 2011-09-01, 14:06
|