|
Neha Narkhede
2011-11-10, 19:15
Patrick Hunt
2011-11-10, 19:50
Patrick Hunt
2011-11-10, 19:54
Patrick Hunt
2011-11-10, 20:20
Patrick Hunt
2011-11-10, 21:02
Camille Fournier
2011-11-10, 21:34
Patrick Hunt
2011-11-10, 21:46
Neha Narkhede
2011-11-10, 21:52
Neha Narkhede
2011-11-10, 22:01
Patrick Hunt
2011-11-10, 22:08
Neha Narkhede
2011-11-10, 23:31
Patrick Hunt
2011-11-11, 00:43
Patrick Hunt
2011-11-11, 01:47
Neha Narkhede
2011-11-11, 15:21
Patrick Hunt
2011-11-11, 17:47
|
-
ephemeral node not deleted after client session closedNeha Narkhede 2011-11-10, 19:15
Hi,
We are seeing this issue again with our Kafka zookeeper setup. Let me explain with the following log entries - Here is the node that didn't get deleted - get /kafka-tracking/brokers/ids/528 172.18.98.101-1320781574264:172.18.98.101:10251 cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011 mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011 pZxid = 0x605f90a78cversion = 0 dataVersion = 0 aclVersion = 0 ephemeralOwner = 0x23220b93b4a33bc dataLength = 47 numChildren = 0 I ran Kishore's awesome version of the LogFormatter tool on the relevant portion of the zookeeper transaction logs - time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0 zxid:0x605f90a59 type:createSession timeOut:6000 time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0 zxid:0x605f90a72 type:closeSession time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1 zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone} We want to understand why there was a 12 second delay in creation of znode after createSession Checking the logs, we found that the there is a 30+ sec delay between the commitProcessor and FinalRequestProcessor. 2011-11-08 11:46:14,661 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing request:: sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a 2011-11-08 11:46:26,045 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing request:: sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a 2011-11-08 11:46:26,047 - DEBUG [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing request:: sessionid:0x23220b93b4a33bc type:create cxid:0 x1 zxid:0x605f90a78 txntype:1 reqpath:n/a 2011-11-08 11:46:59,539 - DEBUG [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z xid:0x605f90a59 txntype:-10 reqpath:n/a 2011-11-08 11:46:59,587 - DEBUG [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx id:0x605f90a72 txntype:-11 reqpath:n/a 2011-11-08 11:46:59,589 - DEBUG [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6 05f90a78 txntype:1 reqpath:n/a Here is the stat output from our zookeeper servers - Latency min/avg/max: 0/53/44712 Latency min/avg/max: 0/11/11727 Latency min/avg/max: 0/12/11994 Latency min/avg/max: 0/9/11707 The relevant bug is here - https://issues.apache.org/jira/browse/ZOOKEEPER-1208 This is a pretty serious issue since the fix is to have someone manually delete the ephemeral node. And until that is done, it blocks the startup of either the consumer client or the Kafka servers. Any help here in debugging this issue is appreciated ! Thanks, Neha
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-10, 19:50
Great report, some addl questions:
1) Why is the session closed, the client closed it or the cluster expired it? 2) which server was the session attached to - the first (44sec max lat) or one of the others? Which server was the leader? 3) the znode exists on all 4 servers, is that right? 4) it's clear that a znode is being created for a session that's just been closed. This is certainly a bug. Perhaps a timing issue due to the huge latency hit that you reported. 5) why are your max latencies, as well as avg latency, so high? a) are these dedicated boxes, not virtualized, correct? b) is the jvm going into gc pause? (try turning on verbose logging, or use "jstat" with the gc options to see the history if you still have those jvms running) c) swapping? d) do you have dedicated spindles for the ZK WAL? If not another process might be causing the fsyncs to pause. (you can use iostat or strace to monitor this) My guess is b given it's the time btw the commit processor and finalreq processor that's so long. Is that the log from the server that's got the 44sec max latency? Patrick On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote: > Hi, > > We are seeing this issue again with our Kafka zookeeper setup. Let me > explain with the following log entries - > > Here is the node that didn't get deleted - > > get /kafka-tracking/brokers/ids/528 > 172.18.98.101-1320781574264:172.18.98.101:10251 > cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011 > mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011 > pZxid = 0x605f90a78cversion = 0 > dataVersion = 0 > aclVersion = 0 > ephemeralOwner = 0x23220b93b4a33bc > dataLength = 47 > numChildren = 0 > > I ran Kishore's awesome version of the LogFormatter tool on the > relevant portion of the zookeeper transaction logs - > > time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0 > zxid:0x605f90a59 type:createSession timeOut:6000 > time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0 > zxid:0x605f90a72 type:closeSession > time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1 > zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone} > > We want to understand why there was a 12 second delay in creation of > znode after createSession > > Checking the logs, we found that the there is a 30+ sec delay between > the commitProcessor and FinalRequestProcessor. > > 2011-11-08 11:46:14,661 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing > request:: sessionid:0x23220b93b4a33bc type:createSession > cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a > 2011-11-08 11:46:26,045 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing > request:: sessionid:0x23220b93b4a33bc type:closeSession > cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a > 2011-11-08 11:46:26,047 - DEBUG > [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing > request:: sessionid:0x23220b93b4a33bc type:create cxid:0 > x1 zxid:0x605f90a78 txntype:1 reqpath:n/a > 2011-11-08 11:46:59,539 - DEBUG > [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: > sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z > xid:0x605f90a59 txntype:-10 reqpath:n/a > 2011-11-08 11:46:59,587 - DEBUG > [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: > sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx > id:0x605f90a72 txntype:-11 reqpath:n/a > 2011-11-08 11:46:59,589 - DEBUG > [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: > sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6 > 05f90a78 txntype:1 reqpath:n/a > > Here is the stat output from our zookeeper servers - > > Latency min/avg/max: 0/53/44712 > Latency min/avg/max: 0/11/11727 > Latency min/avg/max: 0/12/11994 > Latency min/avg/max: 0/9/11707 > > The relevant bug is here - https://issues.apache.org/jira/browse/ZOOKEEPER-1208 > > This is a pretty serious issue since the fix is to have someone > manually delete the ephemeral node. And until that is done, it blocks
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-10, 19:54
ps. was there any instability in the quorum itself during this time
period? Attaching logs to 1208 would be helpful. Patrick On Thu, Nov 10, 2011 at 11:50 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > Great report, some addl questions: > > 1) Why is the session closed, the client closed it or the cluster expired it? > > 2) which server was the session attached to - the first (44sec max > lat) or one of the others? Which server was the leader? > > 3) the znode exists on all 4 servers, is that right? > > 4) it's clear that a znode is being created for a session that's just > been closed. This is certainly a bug. Perhaps a timing issue due to > the huge latency hit that you reported. > > 5) why are your max latencies, as well as avg latency, so high? > a) are these dedicated boxes, not virtualized, correct? > b) is the jvm going into gc pause? (try turning on verbose logging, or > use "jstat" with the gc options to see the history if you still have > those jvms running) > c) swapping? > d) do you have dedicated spindles for the ZK WAL? If not another > process might be causing the fsyncs to pause. (you can use iostat or > strace to monitor this) > > My guess is b given it's the time btw the commit processor and > finalreq processor that's so long. Is that the log from the server > that's got the 44sec max latency? > > Patrick > > On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote: >> Hi, >> >> We are seeing this issue again with our Kafka zookeeper setup. Let me >> explain with the following log entries - >> >> Here is the node that didn't get deleted - >> >> get /kafka-tracking/brokers/ids/528 >> 172.18.98.101-1320781574264:172.18.98.101:10251 >> cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011 >> mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011 >> pZxid = 0x605f90a78cversion = 0 >> dataVersion = 0 >> aclVersion = 0 >> ephemeralOwner = 0x23220b93b4a33bc >> dataLength = 47 >> numChildren = 0 >> >> I ran Kishore's awesome version of the LogFormatter tool on the >> relevant portion of the zookeeper transaction logs - >> >> time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0 >> zxid:0x605f90a59 type:createSession timeOut:6000 >> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0 >> zxid:0x605f90a72 type:closeSession >> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1 >> zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone} >> >> We want to understand why there was a 12 second delay in creation of >> znode after createSession >> >> Checking the logs, we found that the there is a 30+ sec delay between >> the commitProcessor and FinalRequestProcessor. >> >> 2011-11-08 11:46:14,661 - DEBUG >> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >> request:: sessionid:0x23220b93b4a33bc type:createSession >> cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a >> 2011-11-08 11:46:26,045 - DEBUG >> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >> request:: sessionid:0x23220b93b4a33bc type:closeSession >> cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a >> 2011-11-08 11:46:26,047 - DEBUG >> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >> request:: sessionid:0x23220b93b4a33bc type:create cxid:0 >> x1 zxid:0x605f90a78 txntype:1 reqpath:n/a >> 2011-11-08 11:46:59,539 - DEBUG >> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >> sessionid:0x23220b93b4a33bc type:createSession cxid:0x0 z >> xid:0x605f90a59 txntype:-10 reqpath:n/a >> 2011-11-08 11:46:59,587 - DEBUG >> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >> sessionid:0x23220b93b4a33bc type:closeSession cxid:0x0 zx >> id:0x605f90a72 txntype:-11 reqpath:n/a >> 2011-11-08 11:46:59,589 - DEBUG >> [CommitProcessor:1:FinalRequestProcessor@78] - Processing request:: >> sessionid:0x23220b93b4a33bc type:create cxid:0x1 zxid:0x6 >> 05f90a78 txntype:1 reqpath:n/a >> >> Here is the stat output from our zookeeper servers -
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-10, 20:20
I believe I see the issue.
PrepRequestProcessor checks the session is valid on a create. However it doesn't not know about an in-progress close session request The session is not invalidated (session tracker) until FinalRequestProcessor. At which point a create in PrepRP would fail. However prior to that it will get into the pipeline. Resulting in the issue you are seeing. Q: what are you clients doing? It's weird that a create would come from the client after the session has been closed. Do you have multiple threads sharing a session? Patrick On Thu, Nov 10, 2011 at 11:54 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > ps. was there any instability in the quorum itself during this time > period? Attaching logs to 1208 would be helpful. > > Patrick > > On Thu, Nov 10, 2011 at 11:50 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> Great report, some addl questions: >> >> 1) Why is the session closed, the client closed it or the cluster expired it? >> >> 2) which server was the session attached to - the first (44sec max >> lat) or one of the others? Which server was the leader? >> >> 3) the znode exists on all 4 servers, is that right? >> >> 4) it's clear that a znode is being created for a session that's just >> been closed. This is certainly a bug. Perhaps a timing issue due to >> the huge latency hit that you reported. >> >> 5) why are your max latencies, as well as avg latency, so high? >> a) are these dedicated boxes, not virtualized, correct? >> b) is the jvm going into gc pause? (try turning on verbose logging, or >> use "jstat" with the gc options to see the history if you still have >> those jvms running) >> c) swapping? >> d) do you have dedicated spindles for the ZK WAL? If not another >> process might be causing the fsyncs to pause. (you can use iostat or >> strace to monitor this) >> >> My guess is b given it's the time btw the commit processor and >> finalreq processor that's so long. Is that the log from the server >> that's got the 44sec max latency? >> >> Patrick >> >> On Thu, Nov 10, 2011 at 11:15 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> We are seeing this issue again with our Kafka zookeeper setup. Let me >>> explain with the following log entries - >>> >>> Here is the node that didn't get deleted - >>> >>> get /kafka-tracking/brokers/ids/528 >>> 172.18.98.101-1320781574264:172.18.98.101:10251 >>> cZxid = 0x605f90a78ctime = Tue Nov 08 11:46:26 PST 2011 >>> mZxid = 0x605f90a78mtime = Tue Nov 08 11:46:26 PST 2011 >>> pZxid = 0x605f90a78cversion = 0 >>> dataVersion = 0 >>> aclVersion = 0 >>> ephemeralOwner = 0x23220b93b4a33bc >>> dataLength = 47 >>> numChildren = 0 >>> >>> I ran Kishore's awesome version of the LogFormatter tool on the >>> relevant portion of the zookeeper transaction logs - >>> >>> time:11/8/11 11:46:14 AM PST session:0x23220b93b4a33bc cxid:0x0 >>> zxid:0x605f90a59 type:createSession timeOut:6000 >>> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x0 >>> zxid:0x605f90a72 type:closeSession >>> time:11/8/11 11:46:26 AM PST session:0x23220b93b4a33bc cxid:0x1 >>> zxid:0x605f90a78 type:create acl:[31,s{'world,'anyone} >>> >>> We want to understand why there was a 12 second delay in creation of >>> znode after createSession >>> >>> Checking the logs, we found that the there is a 30+ sec delay between >>> the commitProcessor and FinalRequestProcessor. >>> >>> 2011-11-08 11:46:14,661 - DEBUG >>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >>> request:: sessionid:0x23220b93b4a33bc type:createSession >>> cxid:0x0 zxid:0x605f90a59 txntype:-10 reqpath:n/a >>> 2011-11-08 11:46:26,045 - DEBUG >>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >>> request:: sessionid:0x23220b93b4a33bc type:closeSession >>> cxid:0x0 zxid:0x605f90a72 txntype:-11 reqpath:n/a >>> 2011-11-08 11:46:26,047 - DEBUG >>> [QuorumPeer:/0:0:0:0:0:0:0:0:12913:CommitProcessor@159] - Committing >>> request:: sessionid:0x23220b93b4a33bc type:create cxid:0
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-10, 21:02
On Thu, Nov 10, 2011 at 12:20 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote:
> Q: what are you clients doing? It's weird that a create would come > from the client after the session has been closed. Do you have > multiple threads sharing a session? The client (checked java) seems to protect against this. Was the session expired? IC. Based on the cxid of the close session being 0 I'm guessing it's an expiration. Patrick
-
Re: ephemeral node not deleted after client session closedCamille Fournier 2011-11-10, 21:34
This is zk 3.3.3?
>From my phone On Nov 10, 2011 4:02 PM, "Patrick Hunt" <[EMAIL PROTECTED]> wrote: > On Thu, Nov 10, 2011 at 12:20 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > > Q: what are you clients doing? It's weird that a create would come > > from the client after the session has been closed. Do you have > > multiple threads sharing a session? > > The client (checked java) seems to protect against this. Was the > session expired? IC. Based on the cxid of the close session being 0 > I'm guessing it's an expiration. > > Patrick >
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-10, 21:46
According to 1208 this is 3.3.3.
On Thu, Nov 10, 2011 at 1:34 PM, Camille Fournier <[EMAIL PROTECTED]> wrote: > This is zk 3.3.3? > > From my phone > On Nov 10, 2011 4:02 PM, "Patrick Hunt" <[EMAIL PROTECTED]> wrote: > >> On Thu, Nov 10, 2011 at 12:20 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> > Q: what are you clients doing? It's weird that a create would come >> > from the client after the session has been closed. Do you have >> > multiple threads sharing a session? >> >> The client (checked java) seems to protect against this. Was the >> session expired? IC. Based on the cxid of the close session being 0 >> I'm guessing it's an expiration. >> >> Patrick >> >
-
Re: ephemeral node not deleted after client session closedNeha Narkhede 2011-11-10, 21:52
Thanks for the quick responses, guys! Please find my replies inline -
>> 1) Why is the session closed, the client closed it or the cluster expired it? Cluster expired it. >> 2) which server was the session attached to - the first (44sec max lat) or one of the others? Which server was the leader? We didn't run srvr, so no idea which server was the leader. I still have the zookeeper log4j and txn logs, as well as the client logs. Where do I look to find this out ? >> 3) the znode exists on all 4 servers, is that right? Yes >> 5) why are your max latencies, as well as avg latency, so high? >> a) are these dedicated boxes, not virtualized, correct? these are dedicated boxes, but zk is currently co-located with kafka, but on different disks >> b) is the jvm going into gc pause? (try turning on verbose logging, or use "jstat" with the gc options to see the history if you still have those jvms running) I don't believe we had gc logs on these machines. So its unclear. >> d) do you have dedicated spindles for the ZK WAL? If not another process might be causing the fsyncs to pause. (you can use iostat or strace to monitor this) No. The log4j and zk txn logs share the same disks. >> Is that the log from the server that's got the 44sec max latency? Yes. >> This is 3.3.3 ? Yes. >> was there any instability in the quorum itself during this time period? How do I find that out ? Thanks, Neha On Thu, Nov 10, 2011 at 1:34 PM, Camille Fournier <[EMAIL PROTECTED]>wrote: > This is zk 3.3.3? > > From my phone > On Nov 10, 2011 4:02 PM, "Patrick Hunt" <[EMAIL PROTECTED]> wrote: > >> On Thu, Nov 10, 2011 at 12:20 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> > Q: what are you clients doing? It's weird that a create would come >> > from the client after the session has been closed. Do you have >> > multiple threads sharing a session? >> >> The client (checked java) seems to protect against this. Was the >> session expired? IC. Based on the cxid of the close session being 0 >> I'm guessing it's an expiration. >> >> Patrick >> >
-
Re: ephemeral node not deleted after client session closedNeha Narkhede 2011-11-10, 22:01
To answer the remaining client-side questions -
>> Do you have multiple threads sharing a session? No >> Q: what are you clients doing? It's weird that a create would come from the client after the session has been closed. Here are the client side logs. It seems like the client session kept timing out. 2011/11/08 11:46:13.423 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Client session timed out, have not heard from server in 4000ms for sessionid 0x43220b966ee3930, closing socket connection and attempting reconnect 2011/11/08 11:46:13.523 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected)2011/11/08 11:46:13.562 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:129 132011/11/08 11:46:13.563 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Socket connection established to esv4-app30.stg/172.18.98.90:12913, initiating session2011/11/08 11:46:14.173 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Unable to reconnect to ZooKeeper service, session 0x43220b966ee3930 has expired, closing socket connection2011/11/08 11:46:14.195 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Expired) 2011/11/08 11:46:14.195 INFO [ZooKeeper] [main-EventThread] [kafka] Initiating client connection, connectString=app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/kafka-tracking sessionTimeout=6000 watcher=org.I0Itec.zkclient.ZkClient@679a9d19 2011/11/08 11:46:14.263 INFO [ClientCnxn] [main-SendThread()] [kafka] Opening socket connection to server app28.stg/172.18.98.101:129132011/11/0811:46:14.263 INFO [KafkaZooKeeper] [ZkClient-EventThread-495-esv4-app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/k afka-tracking] [kafka] re-registering broker info in ZK for broker 5282011/11/08 11:46:14.263 INFO [ClientCnxn] [main-EventThread] [kafka] EventThread shut down 2011/11/08 11:46:14.263 INFO [KafkaZooKeeper] [ZkClient-EventThread-495-app27.stg:12913,esv4-app28.stg:12913,esv4-app29.stg:12913,esv4-app30.stg:12913/kafka-tracking] [kafka] Registering broker /brokers/ids/528 2011/11/08 11:46:14.264 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/11/08 11:46:15.764 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Client session timed out, have not heard from server in 1500ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:16.448 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/11/08 11:46:16.448 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/11/08 11:46:17.948 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Client session timed out, have not heard from server in 1500ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:18.168 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/11/08 11:46:18.169 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Socket connection established to esv4-app30.stg/172.18.98.90:12913, initiating session 2011/11/08 11:46:19.669 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Client session timed out, have not heard from server in 1500ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:20.042 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Opening socket connection to server esv4-app27.stg/172.18.98.88:12913 2011/11/08 11:46:20.042 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Socket connection established to esv4-app27.stg/172.18.98.88:12913, initiating session 2011/11/08 11:46:21.543 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:21.841 INFO [ClientCnxn] [main-SendThread(app27.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/11/08 11:46:21.842 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/11/08 11:46:23.343 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:23.984 INFO [ClientCnxn] [main-SendThread(app28.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/11/08 11:46:23.985 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/11/08 11:46:25.486 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Client session timed out, have not heard from server in 1501ms for sessionid 0x0, closing socket connection and attempting reconnect 2011/11/08 11:46:25.637 INFO [ClientCnxn] [main-SendThread(app29.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/11/08 11:46:25.637 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Socket connection established to esv4-app30.stg/172.18.98.90:12913, initiating session 2011/11/08 11:46:26.065 INFO [ClientCnxn] [main-SendThread(app30.stg:12913)] [kafka] Session establishment complete on server esv4-app30.stg/172.18.98.90:12913, sessionid = 0x43220b966ee394a, negotiated timeout = 6000 2011/11/08 11:46:26.066 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (SyncConnected) I will attach our conversat
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-10, 22:08
On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[EMAIL PROTECTED]> wrote:
> Thanks for the quick responses, guys! Please find my replies inline - > >>> 1) Why is the session closed, the client closed it or the cluster > expired it? > Cluster expired it. > Yes, I realized after that the cxid is 0 in your logs - that indicates it was expired and not closed explicitly by the client. >>> 3) the znode exists on all 4 servers, is that right? > Yes > This holds up my theory that the PrepRequestProcessor is accepting a create from the client after the session has been expired. >>> 5) why are your max latencies, as well as avg latency, so high? >>> a) are these dedicated boxes, not virtualized, correct? > these are dedicated boxes, but zk is currently co-located with kafka, but > on different disks > >>> b) is the jvm going into gc pause? (try turning on verbose logging, or > use "jstat" with the gc options to see the history if you still have > those jvms running) > I don't believe we had gc logs on these machines. So its unclear. > >>> d) do you have dedicated spindles for the ZK WAL? If not another > process might be causing the fsyncs to pause. (you can use iostat or > strace to monitor this) > No. The log4j and zk txn logs share the same disks. > >>> Is that the log from the server that's got the 44sec max latency? > Yes. > >>> This is 3.3.3 ? > Yes. > >>> was there any instability in the quorum itself during this time > period? > How do I find that out ? The logs would indicate if an election happens. Look for "LOOKING" or "LEADING" or "FOLLOWING". Your comments are consistent with my theory. Seems like a bug in PRP session validation to me. Patrick
-
Re: ephemeral node not deleted after client session closedNeha Narkhede 2011-11-10, 23:31
Thanks Patrick for looking into this issue !
>> The logs would indicate if an election happens. Look for "LOOKING" or "LEADING" or "FOLLOWING". The logs don't have any such entries. So I'm guessing there was no election happening. Do you have thoughts, though, on how easy it would be to reproduce this bug, to verify the bug fix ? Thanks, Neha On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[EMAIL PROTECTED]> > wrote: > > Thanks for the quick responses, guys! Please find my replies inline - > > > >>> 1) Why is the session closed, the client closed it or the cluster > > expired it? > > Cluster expired it. > > > > Yes, I realized after that the cxid is 0 in your logs - that indicates > it was expired and not closed explicitly by the client. > > > >>> 3) the znode exists on all 4 servers, is that right? > > Yes > > > > This holds up my theory that the PrepRequestProcessor is accepting a > create from the client after the session has been expired. > > >>> 5) why are your max latencies, as well as avg latency, so high? > >>> a) are these dedicated boxes, not virtualized, correct? > > these are dedicated boxes, but zk is currently co-located with kafka, but > > on different disks > > > >>> b) is the jvm going into gc pause? (try turning on verbose logging, or > > use "jstat" with the gc options to see the history if you still have > > those jvms running) > > I don't believe we had gc logs on these machines. So its unclear. > > > >>> d) do you have dedicated spindles for the ZK WAL? If not another > > process might be causing the fsyncs to pause. (you can use iostat or > > strace to monitor this) > > No. The log4j and zk txn logs share the same disks. > > > >>> Is that the log from the server that's got the 44sec max latency? > > Yes. > > > >>> This is 3.3.3 ? > > Yes. > > > >>> was there any instability in the quorum itself during this time > > period? > > How do I find that out ? > > The logs would indicate if an election happens. Look for "LOOKING" or > "LEADING" or "FOLLOWING". > > > Your comments are consistent with my theory. Seems like a bug in PRP > session validation to me. > > Patrick >
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-11, 00:43
See my update to 1208 for a test that demonstrates this.
On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[EMAIL PROTECTED]> wrote: > Thanks Patrick for looking into this issue ! > >>> The logs would indicate if an election happens. Look for "LOOKING" or > "LEADING" or "FOLLOWING". > > The logs don't have any such entries. So I'm guessing there was no election > happening. > > Do you have thoughts, though, on how easy it would be to reproduce this > bug, to verify the bug fix ? > > Thanks, > Neha > > > On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > >> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[EMAIL PROTECTED]> >> wrote: >> > Thanks for the quick responses, guys! Please find my replies inline - >> > >> >>> 1) Why is the session closed, the client closed it or the cluster >> > expired it? >> > Cluster expired it. >> > >> >> Yes, I realized after that the cxid is 0 in your logs - that indicates >> it was expired and not closed explicitly by the client. >> >> >> >>> 3) the znode exists on all 4 servers, is that right? >> > Yes >> > >> >> This holds up my theory that the PrepRequestProcessor is accepting a >> create from the client after the session has been expired. >> >> >>> 5) why are your max latencies, as well as avg latency, so high? >> >>> a) are these dedicated boxes, not virtualized, correct? >> > these are dedicated boxes, but zk is currently co-located with kafka, but >> > on different disks >> > >> >>> b) is the jvm going into gc pause? (try turning on verbose logging, or >> > use "jstat" with the gc options to see the history if you still have >> > those jvms running) >> > I don't believe we had gc logs on these machines. So its unclear. >> > >> >>> d) do you have dedicated spindles for the ZK WAL? If not another >> > process might be causing the fsyncs to pause. (you can use iostat or >> > strace to monitor this) >> > No. The log4j and zk txn logs share the same disks. >> > >> >>> Is that the log from the server that's got the 44sec max latency? >> > Yes. >> > >> >>> This is 3.3.3 ? >> > Yes. >> > >> >>> was there any instability in the quorum itself during this time >> > period? >> > How do I find that out ? >> >> The logs would indicate if an election happens. Look for "LOOKING" or >> "LEADING" or "FOLLOWING". >> >> >> Your comments are consistent with my theory. Seems like a bug in PRP >> session validation to me. >> >> Patrick >> >
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-11, 01:47
Ok, patch posted that fixes this (1208). Committers please take a look.
Neha you might want to give a patched version a try. Awesome job helping to document and track down this issue. Thanks! Patrick On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > See my update to 1208 for a test that demonstrates this. > > On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[EMAIL PROTECTED]> wrote: >> Thanks Patrick for looking into this issue ! >> >>>> The logs would indicate if an election happens. Look for "LOOKING" or >> "LEADING" or "FOLLOWING". >> >> The logs don't have any such entries. So I'm guessing there was no election >> happening. >> >> Do you have thoughts, though, on how easy it would be to reproduce this >> bug, to verify the bug fix ? >> >> Thanks, >> Neha >> >> >> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> >>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[EMAIL PROTECTED]> >>> wrote: >>> > Thanks for the quick responses, guys! Please find my replies inline - >>> > >>> >>> 1) Why is the session closed, the client closed it or the cluster >>> > expired it? >>> > Cluster expired it. >>> > >>> >>> Yes, I realized after that the cxid is 0 in your logs - that indicates >>> it was expired and not closed explicitly by the client. >>> >>> >>> >>> 3) the znode exists on all 4 servers, is that right? >>> > Yes >>> > >>> >>> This holds up my theory that the PrepRequestProcessor is accepting a >>> create from the client after the session has been expired. >>> >>> >>> 5) why are your max latencies, as well as avg latency, so high? >>> >>> a) are these dedicated boxes, not virtualized, correct? >>> > these are dedicated boxes, but zk is currently co-located with kafka, but >>> > on different disks >>> > >>> >>> b) is the jvm going into gc pause? (try turning on verbose logging, or >>> > use "jstat" with the gc options to see the history if you still have >>> > those jvms running) >>> > I don't believe we had gc logs on these machines. So its unclear. >>> > >>> >>> d) do you have dedicated spindles for the ZK WAL? If not another >>> > process might be causing the fsyncs to pause. (you can use iostat or >>> > strace to monitor this) >>> > No. The log4j and zk txn logs share the same disks. >>> > >>> >>> Is that the log from the server that's got the 44sec max latency? >>> > Yes. >>> > >>> >>> This is 3.3.3 ? >>> > Yes. >>> > >>> >>> was there any instability in the quorum itself during this time >>> > period? >>> > How do I find that out ? >>> >>> The logs would indicate if an election happens. Look for "LOOKING" or >>> "LEADING" or "FOLLOWING". >>> >>> >>> Your comments are consistent with my theory. Seems like a bug in PRP >>> session validation to me. >>> >>> Patrick >>> >> >
-
Re: ephemeral node not deleted after client session closedNeha Narkhede 2011-11-11, 15:21
Pat,
That is excellent turnaround ! I will take a look at the running the test as well as your patch. Will be a good opportunity for me to start understanding the zookeeper codebase. Thanks again, Neha On Thursday, November 10, 2011, Patrick Hunt <[EMAIL PROTECTED]> wrote: > Ok, patch posted that fixes this (1208). Committers please take a look. > > Neha you might want to give a patched version a try. Awesome job > helping to document and track down this issue. Thanks! > > Patrick > > On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> See my update to 1208 for a test that demonstrates this. >> >> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[EMAIL PROTECTED]> wrote: >>> Thanks Patrick for looking into this issue ! >>> >>>>> The logs would indicate if an election happens. Look for "LOOKING" or >>> "LEADING" or "FOLLOWING". >>> >>> The logs don't have any such entries. So I'm guessing there was no election >>> happening. >>> >>> Do you have thoughts, though, on how easy it would be to reproduce this >>> bug, to verify the bug fix ? >>> >>> Thanks, >>> Neha >>> >>> >>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >>> >>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[EMAIL PROTECTED] > >>>> wrote: >>>> > Thanks for the quick responses, guys! Please find my replies inline - >>>> > >>>> >>> 1) Why is the session closed, the client closed it or the cluster >>>> > expired it? >>>> > Cluster expired it. >>>> > >>>> >>>> Yes, I realized after that the cxid is 0 in your logs - that indicates >>>> it was expired and not closed explicitly by the client. >>>> >>>> >>>> >>> 3) the znode exists on all 4 servers, is that right? >>>> > Yes >>>> > >>>> >>>> This holds up my theory that the PrepRequestProcessor is accepting a >>>> create from the client after the session has been expired. >>>> >>>> >>> 5) why are your max latencies, as well as avg latency, so high? >>>> >>> a) are these dedicated boxes, not virtualized, correct? >>>> > these are dedicated boxes, but zk is currently co-located with kafka, but >>>> > on different disks >>>> > >>>> >>> b) is the jvm going into gc pause? (try turning on verbose logging, or >>>> > use "jstat" with the gc options to see the history if you still have >>>> > those jvms running) >>>> > I don't believe we had gc logs on these machines. So its unclear. >>>> > >>>> >>> d) do you have dedicated spindles for the ZK WAL? If not another >>>> > process might be causing the fsyncs to pause. (you can use iostat or >>>> > strace to monitor this) >>>> > No. The log4j and zk txn logs share the same disks. >>>> > >>>> >>> Is that the log from the server that's got the 44sec max latency? >>>> > Yes. >>>> > >>>> >>> This is 3.3.3 ? >>>> > Yes. >>>> > >>>> >>> was there any instability in the quorum itself during this time >>>> > period? >>>> > How do I find that out ? >>>> >>>> The logs would indicate if an election happens. Look for "LOOKING" or >>>> "LEADING" or "FOLLOWING". >>>> >>>> >>>> Your comments are consistent with my theory. Seems like a bug in PRP >>>> session validation to me. >>>> >>>> Patrick >>>> >>> >> >
-
Re: ephemeral node not deleted after client session closedPatrick Hunt 2011-11-11, 17:47
We always triage new issues as they come in (same when 1208 originally
came in). However our ability to determine the cause is often bounded by the information provided by the user, which in this recent update was excellent and pointed out exactly the flaw. Kudos. We'll button up this issue (1208) for 3.3/3.4/trunk. After which I'll work on cutting a new 3.3.4 that includes this and some other fixes. It would be good if you could test this patch in the mean time. Also, this is particularly worrisome for me, you reported the following for your servers: Latency min/avg/max: 0/53/44712 Latency min/avg/max: 0/11/11727 Latency min/avg/max: 0/12/11994 Latency min/avg/max: 0/9/11707 That's really really terrible performance and you won't be happy with ZK as a result. You need to followup with your ops team to determine why the performance you are seeing is so terrible. Both in terms of average and max latency. Spikes in max latency is usually due to GC, swap, or bad disk performance for the WAL. Bad average latency might indicate poor network performance, or again bad disk performance. Try turning on CMS/parallelGC. Also try using iostat and look at the await times you're seeing for the WAL disk (correlate that with spikes in max latency, those counters can be reset using a 4letterword). Regards, Patrick On Fri, Nov 11, 2011 at 7:21 AM, Neha Narkhede <[EMAIL PROTECTED]> wrote: > Pat, > > That is excellent turnaround ! I will take a look at the running the test > as well as your patch. Will be a good opportunity for me to start > understanding the zookeeper codebase. > > Thanks again, > Neha > > On Thursday, November 10, 2011, Patrick Hunt <[EMAIL PROTECTED]> wrote: >> Ok, patch posted that fixes this (1208). Committers please take a look. >> >> Neha you might want to give a patched version a try. Awesome job >> helping to document and track down this issue. Thanks! >> >> Patrick >> >> On Thu, Nov 10, 2011 at 4:43 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >>> See my update to 1208 for a test that demonstrates this. >>> >>> On Thu, Nov 10, 2011 at 3:31 PM, Neha Narkhede <[EMAIL PROTECTED]> > wrote: >>>> Thanks Patrick for looking into this issue ! >>>> >>>>>> The logs would indicate if an election happens. Look for "LOOKING" or >>>> "LEADING" or "FOLLOWING". >>>> >>>> The logs don't have any such entries. So I'm guessing there was no > election >>>> happening. >>>> >>>> Do you have thoughts, though, on how easy it would be to reproduce this >>>> bug, to verify the bug fix ? >>>> >>>> Thanks, >>>> Neha >>>> >>>> >>>> On Thu, Nov 10, 2011 at 2:08 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: >>>> >>>>> On Thu, Nov 10, 2011 at 1:52 PM, Neha Narkhede <[EMAIL PROTECTED] >> >>>>> wrote: >>>>> > Thanks for the quick responses, guys! Please find my replies inline - >>>>> > >>>>> >>> 1) Why is the session closed, the client closed it or the cluster >>>>> > expired it? >>>>> > Cluster expired it. >>>>> > >>>>> >>>>> Yes, I realized after that the cxid is 0 in your logs - that indicates >>>>> it was expired and not closed explicitly by the client. >>>>> >>>>> >>>>> >>> 3) the znode exists on all 4 servers, is that right? >>>>> > Yes >>>>> > >>>>> >>>>> This holds up my theory that the PrepRequestProcessor is accepting a >>>>> create from the client after the session has been expired. >>>>> >>>>> >>> 5) why are your max latencies, as well as avg latency, so high? >>>>> >>> a) are these dedicated boxes, not virtualized, correct? >>>>> > these are dedicated boxes, but zk is currently co-located with > kafka, but >>>>> > on different disks >>>>> > >>>>> >>> b) is the jvm going into gc pause? (try turning on verbose > logging, or >>>>> > use "jstat" with the gc options to see the history if you still have >>>>> > those jvms running) >>>>> > I don't believe we had gc logs on these machines. So its unclear. >>>>> > >>>>> >>> d) do you have dedicated spindles for the ZK WAL? If not another >>>>> > process might be causing the fsyncs to pause. (you can use iostat or |