|
Manosiz Bhattacharyya
2012-01-18, 21:26
Patrick Hunt
2012-01-18, 22:34
Patrick Hunt
2012-01-18, 22:41
Ted Dunning
2012-01-18, 22:47
Manosiz Bhattacharyya
2012-01-18, 22:47
Patrick Hunt
2012-01-18, 22:53
Manosiz Bhattacharyya
2012-01-19, 00:47
Ted Dunning
2012-01-19, 00:54
Ted Dunning
2012-01-19, 01:15
Manosiz Bhattacharyya
2012-01-19, 01:41
Manosiz Bhattacharyya
2012-01-19, 01:47
Patrick Hunt
2012-01-19, 01:49
Manosiz Bhattacharyya
2012-01-19, 02:18
Ted Dunning
2012-01-19, 06:18
Manosiz Bhattacharyya
2012-01-19, 17:31
Patrick Hunt
2012-01-19, 18:09
Ted Dunning
2012-01-19, 18:18
Manosiz Bhattacharyya
2012-01-19, 18:48
Manosiz Bhattacharyya
2012-01-19, 18:49
Patrick Hunt
2012-01-19, 19:31
Manosiz Bhattacharyya
2012-01-19, 19:47
|
-
Timeouts and ping handlingManosiz Bhattacharyya 2012-01-18, 21:26
Hello,
We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds, and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are ZK nodes) and each node has 5 client connections (a total of 250 connection to the Ensemble). While investigating the zookeeper connections, we found that sometimes pings sent from the zookeeper client does not return from the server within 5 seconds, and the client connection gets disconnected. Digging deeper it seems that pings are enqueued the same way as other requests in the three stage request processing pipeline (prep, sync, finalize) in zkserver. So if there are a lot of write operations from other active sessions in front of a ping from an inactive session in the queues, the inactive session could timeout. My question is whether we can return the ping request from the client immediately from the server, as the purpose of the ping request seems to be to treat it as an heartbeat from relatively inactive sessions. If we keep a separate ping queue in the Prep phase which forwards it straight to the finalize phase, possible requests before the ping which required I/O inside the sync phase would not cause the client timeouts. I hope pings do not generate any order in the database. I did take a cursory look at the code and thought that could be done. Would really appreciate an opinion regarding this. As an aside I should mention that increasing the session timeout to 20 seconds did improved the problem significantly. However as we are using Zookeeper to monitor health of our components, increasing the timeout means that we only get to know a component's death 20 seconds later. This is something we would definitely try to avoid, and would like to go to the 5 second timeout. Regards, Manosiz.
-
Re: Timeouts and ping handlingPatrick Hunt 2012-01-18, 22:34
5 seconds is fairly low. HBs are sent by the client every 1/3 the
timeout, with expectation that it will get a response in another 1/3 the timeout. if not the client session will time out. As a result, any blip of 1.5 sec or more btw the client and server could cause this to happen. Network latency, OS latency, ZK server latency, client latency etc.... I suspect that you are being effected by GC pauses. Have you tuned the GC at all or just the defaults? Monitor the GC in the VM during operation and see if this is effecting you. At the very least you need to turn on parallel/CMS/incremental GC. Patrick On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya <[EMAIL PROTECTED]> wrote: > Hello, > > We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds, > and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are > ZK nodes) and each node has 5 client connections (a total of 250 connection > to the Ensemble). While investigating the zookeeper connections, we found > that sometimes pings sent from the zookeeper client does not return from > the server within 5 seconds, and the client connection gets disconnected. > Digging deeper it seems that pings are enqueued the same way as other > requests in the three stage request processing pipeline (prep, sync, > finalize) in zkserver. So if there are a lot of write operations from other > active sessions in front of a ping from an inactive session in the queues, > the inactive session could timeout. > > My question is whether we can return the ping request from the client > immediately from the server, as the purpose of the ping request seems to be > to treat it as an heartbeat from relatively inactive sessions. If we keep a > separate ping queue in the Prep phase which forwards it straight to the > finalize phase, possible requests before the ping which required I/O inside > the sync phase would not cause the client timeouts. I hope pings do not > generate any order in the database. I did take a cursory look at the code > and thought that could be done. Would really appreciate an opinion > regarding this. > > As an aside I should mention that increasing the session timeout to 20 > seconds did improved the problem significantly. However as we are using > Zookeeper to monitor health of our components, increasing the timeout means > that we only get to know a component's death 20 seconds later. This is > something we would definitely try to avoid, and would like to go to the 5 > second timeout. > > Regards, > Manosiz.
-
Re: Timeouts and ping handlingPatrick Hunt 2012-01-18, 22:41
Forgot to mention, use "stat" and some of the other 4letterwords to
get an idea what your request latency looks like across servers. In particular you can see the "max latency" and correlate that with what you're seeing on the clients & gc (etc...) activity. Patrick On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > 5 seconds is fairly low. HBs are sent by the client every 1/3 the > timeout, with expectation that it will get a response in another 1/3 > the timeout. if not the client session will time out. > > As a result, any blip of 1.5 sec or more btw the client and server > could cause this to happen. Network latency, OS latency, ZK server > latency, client latency etc.... > > I suspect that you are being effected by GC pauses. Have you tuned the > GC at all or just the defaults? Monitor the GC in the VM during > operation and see if this is effecting you. At the very least you need > to turn on parallel/CMS/incremental GC. > > Patrick > > On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]> wrote: >> Hello, >> >> We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds, >> and we see frequent timeouts. We have a cluster of 50 nodes (3 of which are >> ZK nodes) and each node has 5 client connections (a total of 250 connection >> to the Ensemble). While investigating the zookeeper connections, we found >> that sometimes pings sent from the zookeeper client does not return from >> the server within 5 seconds, and the client connection gets disconnected. >> Digging deeper it seems that pings are enqueued the same way as other >> requests in the three stage request processing pipeline (prep, sync, >> finalize) in zkserver. So if there are a lot of write operations from other >> active sessions in front of a ping from an inactive session in the queues, >> the inactive session could timeout. >> >> My question is whether we can return the ping request from the client >> immediately from the server, as the purpose of the ping request seems to be >> to treat it as an heartbeat from relatively inactive sessions. If we keep a >> separate ping queue in the Prep phase which forwards it straight to the >> finalize phase, possible requests before the ping which required I/O inside >> the sync phase would not cause the client timeouts. I hope pings do not >> generate any order in the database. I did take a cursory look at the code >> and thought that could be done. Would really appreciate an opinion >> regarding this. >> >> As an aside I should mention that increasing the session timeout to 20 >> seconds did improved the problem significantly. However as we are using >> Zookeeper to monitor health of our components, increasing the timeout means >> that we only get to know a component's death 20 seconds later. This is >> something we would definitely try to avoid, and would like to go to the 5 >> second timeout. >> >> Regards, >> Manosiz.
-
Re: Timeouts and ping handlingTed Dunning 2012-01-18, 22:47
Monitor GC on *both* ZK server and client. Either side can easily cause a
1-2 second delay if mal-configured. On Wed, Jan 18, 2012 at 10:34 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > I suspect that you are being effected by GC pauses. Have you tuned the > GC at all or just the defaults? Monitor the GC in the VM during > operation and see if this is effecting you. At the very least you need > to turn on parallel/CMS/incremental GC. > >
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-18, 22:47
Thanks a lot for your response. We are running the c-client, as all our
components are C++ applications. We are tracing GC on the server side, but did not see much activity there. We did tune GC. Our gc flags include the following JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC" JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC" JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled" JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8" JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1" JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75" JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly" JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1" The JMX console shows that the old gen is not getting full at all - the new gen is pretty much where the activity is and the pauses in the verbose:gc only shows about times in 10-20 ms. On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > 5 seconds is fairly low. HBs are sent by the client every 1/3 the > timeout, with expectation that it will get a response in another 1/3 > the timeout. if not the client session will time out. > > As a result, any blip of 1.5 sec or more btw the client and server > could cause this to happen. Network latency, OS latency, ZK server > latency, client latency etc.... > > I suspect that you are being effected by GC pauses. Have you tuned the > GC at all or just the defaults? Monitor the GC in the VM during > operation and see if this is effecting you. At the very least you need > to turn on parallel/CMS/incremental GC. > > Patrick > > On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]> wrote: > > Hello, > > > > We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds, > > and we see frequent timeouts. We have a cluster of 50 nodes (3 of which > are > > ZK nodes) and each node has 5 client connections (a total of 250 > connection > > to the Ensemble). While investigating the zookeeper connections, we found > > that sometimes pings sent from the zookeeper client does not return from > > the server within 5 seconds, and the client connection gets disconnected. > > Digging deeper it seems that pings are enqueued the same way as other > > requests in the three stage request processing pipeline (prep, sync, > > finalize) in zkserver. So if there are a lot of write operations from > other > > active sessions in front of a ping from an inactive session in the > queues, > > the inactive session could timeout. > > > > My question is whether we can return the ping request from the client > > immediately from the server, as the purpose of the ping request seems to > be > > to treat it as an heartbeat from relatively inactive sessions. If we > keep a > > separate ping queue in the Prep phase which forwards it straight to the > > finalize phase, possible requests before the ping which required I/O > inside > > the sync phase would not cause the client timeouts. I hope pings do not > > generate any order in the database. I did take a cursory look at the code > > and thought that could be done. Would really appreciate an opinion > > regarding this. > > > > As an aside I should mention that increasing the session timeout to 20 > > seconds did improved the problem significantly. However as we are using > > Zookeeper to monitor health of our components, increasing the timeout > means > > that we only get to know a component's death 20 seconds later. This is > > something we would definitely try to avoid, and would like to go to the 5 > > second timeout. > > > > Regards, > > Manosiz. >
-
Re: Timeouts and ping handlingPatrick Hunt 2012-01-18, 22:53
Next up is disk. (I'm assuming you're not running in a virtualized
environment, correct?) You have a dedicated log device for the transactional logs? Check your disk latency and make sure that's not holding up the writes. What does "stat" show you wrt latency in general and at the time you see the issue on the client? You've looked through the troubleshooting guide? http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting Patrick On Wed, Jan 18, 2012 at 2:47 PM, Manosiz Bhattacharyya <[EMAIL PROTECTED]> wrote: > Thanks a lot for your response. We are running the c-client, as all our > components are C++ applications. We are tracing GC on the server side, but > did not see much activity there. We did tune GC. Our gc flags include the > following > > JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC" > JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC" > JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled" > JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8" > JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1" > JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75" > JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly" > JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1" > > The JMX console shows that the old gen is not getting full at all - the new > gen is pretty much where the activity is and the pauses in the verbose:gc > only shows about times in 10-20 ms. > > On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > >> 5 seconds is fairly low. HBs are sent by the client every 1/3 the >> timeout, with expectation that it will get a response in another 1/3 >> the timeout. if not the client session will time out. >> >> As a result, any blip of 1.5 sec or more btw the client and server >> could cause this to happen. Network latency, OS latency, ZK server >> latency, client latency etc.... >> >> I suspect that you are being effected by GC pauses. Have you tuned the >> GC at all or just the defaults? Monitor the GC in the VM during >> operation and see if this is effecting you. At the very least you need >> to turn on parallel/CMS/incremental GC. >> >> Patrick >> >> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya >> <[EMAIL PROTECTED]> wrote: >> > Hello, >> > >> > We are using Zookeeper-3.3.4 with client session timeouts of 5 seconds, >> > and we see frequent timeouts. We have a cluster of 50 nodes (3 of which >> are >> > ZK nodes) and each node has 5 client connections (a total of 250 >> connection >> > to the Ensemble). While investigating the zookeeper connections, we found >> > that sometimes pings sent from the zookeeper client does not return from >> > the server within 5 seconds, and the client connection gets disconnected. >> > Digging deeper it seems that pings are enqueued the same way as other >> > requests in the three stage request processing pipeline (prep, sync, >> > finalize) in zkserver. So if there are a lot of write operations from >> other >> > active sessions in front of a ping from an inactive session in the >> queues, >> > the inactive session could timeout. >> > >> > My question is whether we can return the ping request from the client >> > immediately from the server, as the purpose of the ping request seems to >> be >> > to treat it as an heartbeat from relatively inactive sessions. If we >> keep a >> > separate ping queue in the Prep phase which forwards it straight to the >> > finalize phase, possible requests before the ping which required I/O >> inside >> > the sync phase would not cause the client timeouts. I hope pings do not >> > generate any order in the database. I did take a cursory look at the code >> > and thought that could be done. Would really appreciate an opinion >> > regarding this. >> > >> > As an aside I should mention that increasing the session timeout to 20 >> > seconds did improved the problem significantly. However as we are using >> > Zookeeper to monitor health of our components, increasing the timeout >> means >> > that we only get to know a component's death 20 seconds later. This is
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 00:47
Thanks Patrick for your answer,
Actually we are in a virtualized environment, we have a FIO disk for transactional logs. It does have some latency sometimes during FIO garbage collection. We know this could be the potential issue, but was trying to workaround that. We were trying to qualify the requests into two types - either HB's or normal requests. Isn't it better to reject normal requests if the queue size is full to say a certain threshold, but keep the session alive. That way the flow control can be achieved with the users session retrying the operation, but the session health would be maintained. Regards, Manosiz. On Wed, Jan 18, 2012 at 2:53 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > Next up is disk. (I'm assuming you're not running in a virtualized > environment, correct?) You have a dedicated log device for the > transactional logs? Check your disk latency and make sure that's not > holding up the writes. > > What does "stat" show you wrt latency in general and at the time you > see the issue on the client? > > You've looked through the troubleshooting guide? > http://wiki.apache.org/hadoop/ZooKeeper/Troubleshooting > > Patrick > > On Wed, Jan 18, 2012 at 2:47 PM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]> wrote: > > Thanks a lot for your response. We are running the c-client, as all our > > components are C++ applications. We are tracing GC on the server side, > but > > did not see much activity there. We did tune GC. Our gc flags include the > > following > > > > JVMFLAGS="$JVMFLAGS -XX:+UseParNewGC" > > JVMFLAGS="$JVMFLAGS -XX:+UseConcMarkSweepGC" > > JVMFLAGS="$JVMFLAGS -XX:+CMSParallelRemarkEnabled" > > JVMFLAGS="$JVMFLAGS -XX:SurvivorRatio=8" > > JVMFLAGS="$JVMFLAGS -XX:MaxTenuringThreshold=1" > > JVMFLAGS="$JVMFLAGS -XX:CMSInitiatingOccupancyFraction=75" > > JVMFLAGS="$JVMFLAGS -XX:+UseCMSInitiatingOccupancyOnly" > > JVMFLAGS="$JVMFLAGS -XX:ParallelCMSThreads=1" > > > > The JMX console shows that the old gen is not getting full at all - the > new > > gen is pretty much where the activity is and the pauses in the verbose:gc > > only shows about times in 10-20 ms. > > > > On Wed, Jan 18, 2012 at 2:34 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > > > >> 5 seconds is fairly low. HBs are sent by the client every 1/3 the > >> timeout, with expectation that it will get a response in another 1/3 > >> the timeout. if not the client session will time out. > >> > >> As a result, any blip of 1.5 sec or more btw the client and server > >> could cause this to happen. Network latency, OS latency, ZK server > >> latency, client latency etc.... > >> > >> I suspect that you are being effected by GC pauses. Have you tuned the > >> GC at all or just the defaults? Monitor the GC in the VM during > >> operation and see if this is effecting you. At the very least you need > >> to turn on parallel/CMS/incremental GC. > >> > >> Patrick > >> > >> On Wed, Jan 18, 2012 at 1:26 PM, Manosiz Bhattacharyya > >> <[EMAIL PROTECTED]> wrote: > >> > Hello, > >> > > >> > We are using Zookeeper-3.3.4 with client session timeouts of 5 > seconds, > >> > and we see frequent timeouts. We have a cluster of 50 nodes (3 of > which > >> are > >> > ZK nodes) and each node has 5 client connections (a total of 250 > >> connection > >> > to the Ensemble). While investigating the zookeeper connections, we > found > >> > that sometimes pings sent from the zookeeper client does not return > from > >> > the server within 5 seconds, and the client connection gets > disconnected. > >> > Digging deeper it seems that pings are enqueued the same way as other > >> > requests in the three stage request processing pipeline (prep, sync, > >> > finalize) in zkserver. So if there are a lot of write operations from > >> other > >> > active sessions in front of a ping from an inactive session in the > >> queues, > >> > the inactive session could timeout. > >> > > >> > My question is whether we can return the ping request from the client > >> > immediately from the server, as the purpose of the ping request seems
-
Re: Timeouts and ping handlingTed Dunning 2012-01-19, 00:54
That really depends on whether you think that a stuck server is a problem.
The primary indication of that is a full queue and you are suggesting that we not detect this situation. It isn't a matter of keeping the session alive ... it is a matter of whether or not we can guarantee that things are working. By all appearances, they aren't and ZK is all about guarantees. On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya <[EMAIL PROTECTED]>wrote: > We were trying to qualify the requests into two types - either HB's or > normal requests. Isn't it better to reject normal requests if the queue > size is full to say a certain threshold, but keep the session alive. That > way the flow control can be achieved with the users session retrying the > operation, but the session health would be maintained. > >
-
Re: Timeouts and ping handlingTed Dunning 2012-01-19, 01:15
Does FIO stand for Fusion I/O?
On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya <[EMAIL PROTECTED]>wrote: > ... we have a FIO disk
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 01:41
Yes.
On Wed, Jan 18, 2012 at 5:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Does FIO stand for Fusion I/O? > > On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]>wrote: > > > ... we have a FIO disk >
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 01:47
I was not indicating that we do not detect the situation of a stuck server.
A watchdog of some sort keeping track of queue changes could also suffice. Thanks for you input. I guess we will try to work out with the increasing the timeout. -- Manosiz. On Wed, Jan 18, 2012 at 4:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > That really depends on whether you think that a stuck server is a problem. > The primary indication of that is a full queue and you are suggesting that > we not detect this situation. It isn't a matter of keeping the session > alive ... it is a matter of whether or not we can guarantee that things are > working. By all appearances, they aren't and ZK is all about guarantees. > > > > On Thu, Jan 19, 2012 at 12:47 AM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]>wrote: > > > We were trying to qualify the requests into two types - either HB's or > > normal requests. Isn't it better to reject normal requests if the queue > > size is full to say a certain threshold, but keep the session alive. That > > way the flow control can be achieved with the users session retrying the > > operation, but the session health would be maintained. > > > > >
-
Re: Timeouts and ping handlingPatrick Hunt 2012-01-19, 01:49
On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya
<[EMAIL PROTECTED]> wrote: > Thanks Patrick for your answer, No problem. > Actually we are in a virtualized environment, we have a FIO disk for > transactional logs. It does have some latency sometimes during FIO garbage > collection. We know this could be the potential issue, but was trying to > workaround that. Ah, I see. I saw something very similar to this recently with SSDs used for the datadir. The fdatasync latency was sometimes > 10 seconds. I suspect it happened as a result of disk GC activity. I was able to identify the problem by running something like this: sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt and then graphing the results (log scale). You should try running this against your servers to confirm that it is indeed the problem. > We were trying to qualify the requests into two types - either HB's or > normal requests. Isn't it better to reject normal requests if the queue > size is full to say a certain threshold, but keep the session alive. That > way the flow control can be achieved with the users session retrying the > operation, but the session health would be maintained. What good is a session (connection) that's not usable? You're better off disconnecting and re-establishing with a server that can process your requests in a timely fashion. ZK looks at availability from a service perspective, not from an individual session/connection perspective. The whole more important than the parts. There already is very sophisticated flow control going on - e.g. the sessions shut down and stop reading requests when the number of outstanding requests on a server exceeds some threshold. Once the server catches up it starts reading again. Again - checkout your "stat" results for insight into this. (ie "outstanding requests") Patrick
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 02:18
I will do as you mention.
We are using the async API's throughout. Also we do not write too much data into Zookeeper. We just use it for leadership elections and health monitoring, which is why we see the timeouts typically on idle zookeeper connections. The reason why we want the sessions to be alive is because of the leadership election algorithm that we use from the zookeeper recipe. So if a connection is broken for the leader node, the ephemeral node that guaranteed its leadership is lost, and reconnecting will create a new node which does not guarantee leadership. We then have to re-elect a new leader - which requires significant work. The bigger the timeout, bigger is the time the cluster stays without a master for a particular service, as the old master cannot keep on working once it has known its session is gone and with it, its ephemeral node. As we are trying to have highly available service (not internet scale, but at the scale of a storage system with ms latencies typically), we thought about reducing the timeout, but keeping the session open. Also note the node that typically is the master does not write too often into zookeeper. Thanks, Manosiz. On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]> wrote: > > Thanks Patrick for your answer, > > No problem. > > > Actually we are in a virtualized environment, we have a FIO disk for > > transactional logs. It does have some latency sometimes during FIO > garbage > > collection. We know this could be the potential issue, but was trying to > > workaround that. > > Ah, I see. I saw something very similar to this recently with SSDs > used for the datadir. The fdatasync latency was sometimes > 10 > seconds. I suspect it happened as a result of disk GC activity. > > I was able to identify the problem by running something like this: > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt > > and then graphing the results (log scale). You should try running this > against your servers to confirm that it is indeed the problem. > > > We were trying to qualify the requests into two types - either HB's or > > normal requests. Isn't it better to reject normal requests if the queue > > size is full to say a certain threshold, but keep the session alive. That > > way the flow control can be achieved with the users session retrying the > > operation, but the session health would be maintained. > > What good is a session (connection) that's not usable? You're better > off disconnecting and re-establishing with a server that can process > your requests in a timely fashion. > > ZK looks at availability from a service perspective, not from an > individual session/connection perspective. The whole more important > than the parts. There already is very sophisticated flow control going > on - e.g. the sessions shut down and stop reading requests when the > number of outstanding requests on a server exceeds some threshold. > Once the server catches up it starts reading again. Again - checkout > your "stat" results for insight into this. (ie "outstanding requests") > > Patrick >
-
Re: Timeouts and ping handlingTed Dunning 2012-01-19, 06:18
If you aren't pushing much data through ZK, there is almost no way that the
request queue can fill up without the log or snapshot disks being slow. See what happens if you put the log into a real disk or (heaven help us) onto a tmpfs partition. On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya <[EMAIL PROTECTED]>wrote: > I will do as you mention. > > We are using the async API's throughout. Also we do not write too much data > into Zookeeper. We just use it for leadership elections and health > monitoring, which is why we see the timeouts typically on idle zookeeper > connections. > > The reason why we want the sessions to be alive is because of the > leadership election algorithm that we use from the zookeeper recipe. So if > a connection is broken for the leader node, the ephemeral node that > guaranteed its leadership is lost, and reconnecting will create a new node > which does not guarantee leadership. We then have to re-elect a new leader > - which requires significant work. The bigger the timeout, bigger is the > time the cluster stays without a master for a particular service, as the > old master cannot keep on working once it has known its session is gone and > with it, its ephemeral node. As we are trying to have highly available > service (not internet scale, but at the scale of a storage system with ms > latencies typically), we thought about reducing the timeout, but keeping > the session open. Also note the node that typically is the master does not > write too often into zookeeper. > > Thanks, > Manosiz. > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > > <[EMAIL PROTECTED]> wrote: > > > Thanks Patrick for your answer, > > > > No problem. > > > > > Actually we are in a virtualized environment, we have a FIO disk for > > > transactional logs. It does have some latency sometimes during FIO > > garbage > > > collection. We know this could be the potential issue, but was trying > to > > > workaround that. > > > > Ah, I see. I saw something very similar to this recently with SSDs > > used for the datadir. The fdatasync latency was sometimes > 10 > > seconds. I suspect it happened as a result of disk GC activity. > > > > I was able to identify the problem by running something like this: > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt > > > > and then graphing the results (log scale). You should try running this > > against your servers to confirm that it is indeed the problem. > > > > > We were trying to qualify the requests into two types - either HB's or > > > normal requests. Isn't it better to reject normal requests if the queue > > > size is full to say a certain threshold, but keep the session alive. > That > > > way the flow control can be achieved with the users session retrying > the > > > operation, but the session health would be maintained. > > > > What good is a session (connection) that's not usable? You're better > > off disconnecting and re-establishing with a server that can process > > your requests in a timely fashion. > > > > ZK looks at availability from a service perspective, not from an > > individual session/connection perspective. The whole more important > > than the parts. There already is very sophisticated flow control going > > on - e.g. the sessions shut down and stop reading requests when the > > number of outstanding requests on a server exceeds some threshold. > > Once the server catches up it starts reading again. Again - checkout > > your "stat" results for insight into this. (ie "outstanding requests") > > > > Patrick > > >
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 17:31
I do not think that there is a problem with the queue size. I guess the
problem is more with latency when the Fusion I/O goes in for a GC. We are enabling stats on the Zookeeper and the fusion I/O to be more precise. Does Zookeeper typically do only sequential I/O, or does it do some random too. We could then move the logs to a disk. Thanks, Manosiz. On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > If you aren't pushing much data through ZK, there is almost no way that the > request queue can fill up without the log or snapshot disks being slow. > See what happens if you put the log into a real disk or (heaven help us) > onto a tmpfs partition. > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]>wrote: > > > I will do as you mention. > > > > We are using the async API's throughout. Also we do not write too much > data > > into Zookeeper. We just use it for leadership elections and health > > monitoring, which is why we see the timeouts typically on idle zookeeper > > connections. > > > > The reason why we want the sessions to be alive is because of the > > leadership election algorithm that we use from the zookeeper recipe. So > if > > a connection is broken for the leader node, the ephemeral node that > > guaranteed its leadership is lost, and reconnecting will create a new > node > > which does not guarantee leadership. We then have to re-elect a new > leader > > - which requires significant work. The bigger the timeout, bigger is the > > time the cluster stays without a master for a particular service, as the > > old master cannot keep on working once it has known its session is gone > and > > with it, its ephemeral node. As we are trying to have highly available > > service (not internet scale, but at the scale of a storage system with ms > > latencies typically), we thought about reducing the timeout, but keeping > > the session open. Also note the node that typically is the master does > not > > write too often into zookeeper. > > > > Thanks, > > Manosiz. > > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > > > <[EMAIL PROTECTED]> wrote: > > > > Thanks Patrick for your answer, > > > > > > No problem. > > > > > > > Actually we are in a virtualized environment, we have a FIO disk for > > > > transactional logs. It does have some latency sometimes during FIO > > > garbage > > > > collection. We know this could be the potential issue, but was trying > > to > > > > workaround that. > > > > > > Ah, I see. I saw something very similar to this recently with SSDs > > > used for the datadir. The fdatasync latency was sometimes > 10 > > > seconds. I suspect it happened as a result of disk GC activity. > > > > > > I was able to identify the problem by running something like this: > > > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt > > > > > > and then graphing the results (log scale). You should try running this > > > against your servers to confirm that it is indeed the problem. > > > > > > > We were trying to qualify the requests into two types - either HB's > or > > > > normal requests. Isn't it better to reject normal requests if the > queue > > > > size is full to say a certain threshold, but keep the session alive. > > That > > > > way the flow control can be achieved with the users session retrying > > the > > > > operation, but the session health would be maintained. > > > > > > What good is a session (connection) that's not usable? You're better > > > off disconnecting and re-establishing with a server that can process > > > your requests in a timely fashion. > > > > > > ZK looks at availability from a service perspective, not from an > > > individual session/connection perspective. The whole more important > > > than the parts. There already is very sophisticated flow control going > > > on - e.g. the sessions shut down and stop reading requests when the
-
Re: Timeouts and ping handlingPatrick Hunt 2012-01-19, 18:09
On Thu, Jan 19, 2012 at 9:31 AM, Manosiz Bhattacharyya
<[EMAIL PROTECTED]> wrote: > I do not think that there is a problem with the queue size. I guess the > problem is more with latency when the Fusion I/O goes in for a GC. We are > enabling stats on the Zookeeper and the fusion I/O to be more precise. Does > Zookeeper typically do only sequential I/O, or does it do some random too. > We could then move the logs to a disk. I was going to say what Ted said - it's odd to see such long pauses given you don't have GC issues and you are barely using the system. Your suspicion on disk may be correct. The server really just does sequential IO - it's writing the WAL for any changes and periodically taking the snapshot. Note that this could be an issue in ZK itself. The c client talking to the service using async operations with such low round trip expectations is not something we typically see or in particular test. It will be interesting to see the results of your further investigations. Btw, you are using c client - which version? the pthreads version or the version where you manage timeslicing yourself? Patrick
-
Re: Timeouts and ping handlingTed Dunning 2012-01-19, 18:18
ZK does pretty much entirely sequential I/O.
One thing that it does which might be very, very bad for SSD is that it pre-allocates disk extents in the log by writing a bunch of zeros. This is to avoid directory updates as the log is written, but it doubles the load on the SSD. On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya <[EMAIL PROTECTED]>wrote: > I do not think that there is a problem with the queue size. I guess the > problem is more with latency when the Fusion I/O goes in for a GC. We are > enabling stats on the Zookeeper and the fusion I/O to be more precise. Does > Zookeeper typically do only sequential I/O, or does it do some random too. > We could then move the logs to a disk. > > Thanks, > Manosiz. > > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > If you aren't pushing much data through ZK, there is almost no way that > the > > request queue can fill up without the log or snapshot disks being slow. > > See what happens if you put the log into a real disk or (heaven help us) > > onto a tmpfs partition. > > > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya > > <[EMAIL PROTECTED]>wrote: > > > > > I will do as you mention. > > > > > > We are using the async API's throughout. Also we do not write too much > > data > > > into Zookeeper. We just use it for leadership elections and health > > > monitoring, which is why we see the timeouts typically on idle > zookeeper > > > connections. > > > > > > The reason why we want the sessions to be alive is because of the > > > leadership election algorithm that we use from the zookeeper recipe. So > > if > > > a connection is broken for the leader node, the ephemeral node that > > > guaranteed its leadership is lost, and reconnecting will create a new > > node > > > which does not guarantee leadership. We then have to re-elect a new > > leader > > > - which requires significant work. The bigger the timeout, bigger is > the > > > time the cluster stays without a master for a particular service, as > the > > > old master cannot keep on working once it has known its session is gone > > and > > > with it, its ephemeral node. As we are trying to have highly available > > > service (not internet scale, but at the scale of a storage system with > ms > > > latencies typically), we thought about reducing the timeout, but > keeping > > > the session open. Also note the node that typically is the master does > > not > > > write too often into zookeeper. > > > > > > Thanks, > > > Manosiz. > > > > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> > wrote: > > > > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > > > > <[EMAIL PROTECTED]> wrote: > > > > > Thanks Patrick for your answer, > > > > > > > > No problem. > > > > > > > > > Actually we are in a virtualized environment, we have a FIO disk > for > > > > > transactional logs. It does have some latency sometimes during FIO > > > > garbage > > > > > collection. We know this could be the potential issue, but was > trying > > > to > > > > > workaround that. > > > > > > > > Ah, I see. I saw something very similar to this recently with SSDs > > > > used for the datadir. The fdatasync latency was sometimes > 10 > > > > seconds. I suspect it happened as a result of disk GC activity. > > > > > > > > I was able to identify the problem by running something like this: > > > > > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt > > > > > > > > and then graphing the results (log scale). You should try running > this > > > > against your servers to confirm that it is indeed the problem. > > > > > > > > > We were trying to qualify the requests into two types - either HB's > > or > > > > > normal requests. Isn't it better to reject normal requests if the > > queue > > > > > size is full to say a certain threshold, but keep the session > alive. > > > That > > > > > way the flow control can be achieved with the users session > retrying > > > the
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 18:48
We are using the zookeeper c client version 3.3.4 the same as the server.
We use libptread-2.10.1.so, and no special time slicing in user code. Will let you know what we find. Thanks, Manosiz. On Thu, Jan 19, 2012 at 10:09 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > On Thu, Jan 19, 2012 at 9:31 AM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]> wrote: > > I do not think that there is a problem with the queue size. I guess the > > problem is more with latency when the Fusion I/O goes in for a GC. We are > > enabling stats on the Zookeeper and the fusion I/O to be more precise. > Does > > Zookeeper typically do only sequential I/O, or does it do some random > too. > > We could then move the logs to a disk. > > I was going to say what Ted said - it's odd to see such long pauses > given you don't have GC issues and you are barely using the system. > Your suspicion on disk may be correct. > > The server really just does sequential IO - it's writing the WAL for > any changes and periodically taking the snapshot. > > Note that this could be an issue in ZK itself. The c client talking to > the service using async operations with such low round trip > expectations is not something we typically see or in particular test. > It will be interesting to see the results of your further > investigations. > > Btw, you are using c client - which version? the pthreads version or > the version where you manage timeslicing yourself? > > Patrick >
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 18:49
Thanks a lot for this info. A pointer in the code to where you do this
preallocation or a flag to disable this would be very beneficial. On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > ZK does pretty much entirely sequential I/O. > > One thing that it does which might be very, very bad for SSD is that it > pre-allocates disk extents in the log by writing a bunch of zeros. This is > to avoid directory updates as the log is written, but it doubles the load > on the SSD. > > On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]>wrote: > > > I do not think that there is a problem with the queue size. I guess the > > problem is more with latency when the Fusion I/O goes in for a GC. We are > > enabling stats on the Zookeeper and the fusion I/O to be more precise. > Does > > Zookeeper typically do only sequential I/O, or does it do some random > too. > > We could then move the logs to a disk. > > > > Thanks, > > Manosiz. > > > > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > If you aren't pushing much data through ZK, there is almost no way that > > the > > > request queue can fill up without the log or snapshot disks being slow. > > > See what happens if you put the log into a real disk or (heaven help > us) > > > onto a tmpfs partition. > > > > > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya > > > <[EMAIL PROTECTED]>wrote: > > > > > > > I will do as you mention. > > > > > > > > We are using the async API's throughout. Also we do not write too > much > > > data > > > > into Zookeeper. We just use it for leadership elections and health > > > > monitoring, which is why we see the timeouts typically on idle > > zookeeper > > > > connections. > > > > > > > > The reason why we want the sessions to be alive is because of the > > > > leadership election algorithm that we use from the zookeeper recipe. > So > > > if > > > > a connection is broken for the leader node, the ephemeral node that > > > > guaranteed its leadership is lost, and reconnecting will create a new > > > node > > > > which does not guarantee leadership. We then have to re-elect a new > > > leader > > > > - which requires significant work. The bigger the timeout, bigger is > > the > > > > time the cluster stays without a master for a particular service, as > > the > > > > old master cannot keep on working once it has known its session is > gone > > > and > > > > with it, its ephemeral node. As we are trying to have highly > available > > > > service (not internet scale, but at the scale of a storage system > with > > ms > > > > latencies typically), we thought about reducing the timeout, but > > keeping > > > > the session open. Also note the node that typically is the master > does > > > not > > > > write too often into zookeeper. > > > > > > > > Thanks, > > > > Manosiz. > > > > > > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> > > wrote: > > > > > > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > > > > > <[EMAIL PROTECTED]> wrote: > > > > > > Thanks Patrick for your answer, > > > > > > > > > > No problem. > > > > > > > > > > > Actually we are in a virtualized environment, we have a FIO disk > > for > > > > > > transactional logs. It does have some latency sometimes during > FIO > > > > > garbage > > > > > > collection. We know this could be the potential issue, but was > > trying > > > > to > > > > > > workaround that. > > > > > > > > > > Ah, I see. I saw something very similar to this recently with SSDs > > > > > used for the datadir. The fdatasync latency was sometimes > 10 > > > > > seconds. I suspect it happened as a result of disk GC activity. > > > > > > > > > > I was able to identify the problem by running something like this: > > > > > > > > > > sudo strace -r -T -f -p 8066 -e trace=fsync,fdatasync -o trace.txt > > > > > > > > > > and then graphing the results (log scale). You should try running > > this > > > > > against your servers to confirm that it is indeed the problem.
-
Re: Timeouts and ping handlingPatrick Hunt 2012-01-19, 19:31
See "preAllocSize"
http://zookeeper.apache.org/doc/r3.4.2/zookeeperAdmin.html#sc_advancedConfiguration On Thu, Jan 19, 2012 at 10:49 AM, Manosiz Bhattacharyya <[EMAIL PROTECTED]> wrote: > Thanks a lot for this info. A pointer in the code to where you do this > preallocation or a flag to disable this would be very beneficial. > > On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> ZK does pretty much entirely sequential I/O. >> >> One thing that it does which might be very, very bad for SSD is that it >> pre-allocates disk extents in the log by writing a bunch of zeros. This is >> to avoid directory updates as the log is written, but it doubles the load >> on the SSD. >> >> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya >> <[EMAIL PROTECTED]>wrote: >> >> > I do not think that there is a problem with the queue size. I guess the >> > problem is more with latency when the Fusion I/O goes in for a GC. We are >> > enabling stats on the Zookeeper and the fusion I/O to be more precise. >> Does >> > Zookeeper typically do only sequential I/O, or does it do some random >> too. >> > We could then move the logs to a disk. >> > >> > Thanks, >> > Manosiz. >> > >> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <[EMAIL PROTECTED]> >> > wrote: >> > >> > > If you aren't pushing much data through ZK, there is almost no way that >> > the >> > > request queue can fill up without the log or snapshot disks being slow. >> > > See what happens if you put the log into a real disk or (heaven help >> us) >> > > onto a tmpfs partition. >> > > >> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya >> > > <[EMAIL PROTECTED]>wrote: >> > > >> > > > I will do as you mention. >> > > > >> > > > We are using the async API's throughout. Also we do not write too >> much >> > > data >> > > > into Zookeeper. We just use it for leadership elections and health >> > > > monitoring, which is why we see the timeouts typically on idle >> > zookeeper >> > > > connections. >> > > > >> > > > The reason why we want the sessions to be alive is because of the >> > > > leadership election algorithm that we use from the zookeeper recipe. >> So >> > > if >> > > > a connection is broken for the leader node, the ephemeral node that >> > > > guaranteed its leadership is lost, and reconnecting will create a new >> > > node >> > > > which does not guarantee leadership. We then have to re-elect a new >> > > leader >> > > > - which requires significant work. The bigger the timeout, bigger is >> > the >> > > > time the cluster stays without a master for a particular service, as >> > the >> > > > old master cannot keep on working once it has known its session is >> gone >> > > and >> > > > with it, its ephemeral node. As we are trying to have highly >> available >> > > > service (not internet scale, but at the scale of a storage system >> with >> > ms >> > > > latencies typically), we thought about reducing the timeout, but >> > keeping >> > > > the session open. Also note the node that typically is the master >> does >> > > not >> > > > write too often into zookeeper. >> > > > >> > > > Thanks, >> > > > Manosiz. >> > > > >> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> >> > wrote: >> > > > >> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya >> > > > > <[EMAIL PROTECTED]> wrote: >> > > > > > Thanks Patrick for your answer, >> > > > > >> > > > > No problem. >> > > > > >> > > > > > Actually we are in a virtualized environment, we have a FIO disk >> > for >> > > > > > transactional logs. It does have some latency sometimes during >> FIO >> > > > > garbage >> > > > > > collection. We know this could be the potential issue, but was >> > trying >> > > > to >> > > > > > workaround that. >> > > > > >> > > > > Ah, I see. I saw something very similar to this recently with SSDs >> > > > > used for the datadir. The fdatasync latency was sometimes > 10 >> > > > > seconds. I suspect it happened as a result of disk GC activity.
-
Re: Timeouts and ping handlingManosiz Bhattacharyya 2012-01-19, 19:47
Thanks,
Manosiz. On Thu, Jan 19, 2012 at 11:31 AM, Patrick Hunt <[EMAIL PROTECTED]> wrote: > See "preAllocSize" > > http://zookeeper.apache.org/doc/r3.4.2/zookeeperAdmin.html#sc_advancedConfiguration > > On Thu, Jan 19, 2012 at 10:49 AM, Manosiz Bhattacharyya > <[EMAIL PROTECTED]> wrote: > > Thanks a lot for this info. A pointer in the code to where you do this > > preallocation or a flag to disable this would be very beneficial. > > > > On Thu, Jan 19, 2012 at 10:18 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > >> ZK does pretty much entirely sequential I/O. > >> > >> One thing that it does which might be very, very bad for SSD is that it > >> pre-allocates disk extents in the log by writing a bunch of zeros. > This is > >> to avoid directory updates as the log is written, but it doubles the > load > >> on the SSD. > >> > >> On Thu, Jan 19, 2012 at 5:31 PM, Manosiz Bhattacharyya > >> <[EMAIL PROTECTED]>wrote: > >> > >> > I do not think that there is a problem with the queue size. I guess > the > >> > problem is more with latency when the Fusion I/O goes in for a GC. We > are > >> > enabling stats on the Zookeeper and the fusion I/O to be more precise. > >> Does > >> > Zookeeper typically do only sequential I/O, or does it do some random > >> too. > >> > We could then move the logs to a disk. > >> > > >> > Thanks, > >> > Manosiz. > >> > > >> > On Wed, Jan 18, 2012 at 10:18 PM, Ted Dunning <[EMAIL PROTECTED]> > >> > wrote: > >> > > >> > > If you aren't pushing much data through ZK, there is almost no way > that > >> > the > >> > > request queue can fill up without the log or snapshot disks being > slow. > >> > > See what happens if you put the log into a real disk or (heaven > help > >> us) > >> > > onto a tmpfs partition. > >> > > > >> > > On Thu, Jan 19, 2012 at 2:18 AM, Manosiz Bhattacharyya > >> > > <[EMAIL PROTECTED]>wrote: > >> > > > >> > > > I will do as you mention. > >> > > > > >> > > > We are using the async API's throughout. Also we do not write too > >> much > >> > > data > >> > > > into Zookeeper. We just use it for leadership elections and health > >> > > > monitoring, which is why we see the timeouts typically on idle > >> > zookeeper > >> > > > connections. > >> > > > > >> > > > The reason why we want the sessions to be alive is because of the > >> > > > leadership election algorithm that we use from the zookeeper > recipe. > >> So > >> > > if > >> > > > a connection is broken for the leader node, the ephemeral node > that > >> > > > guaranteed its leadership is lost, and reconnecting will create a > new > >> > > node > >> > > > which does not guarantee leadership. We then have to re-elect a > new > >> > > leader > >> > > > - which requires significant work. The bigger the timeout, bigger > is > >> > the > >> > > > time the cluster stays without a master for a particular service, > as > >> > the > >> > > > old master cannot keep on working once it has known its session is > >> gone > >> > > and > >> > > > with it, its ephemeral node. As we are trying to have highly > >> available > >> > > > service (not internet scale, but at the scale of a storage system > >> with > >> > ms > >> > > > latencies typically), we thought about reducing the timeout, but > >> > keeping > >> > > > the session open. Also note the node that typically is the master > >> does > >> > > not > >> > > > write too often into zookeeper. > >> > > > > >> > > > Thanks, > >> > > > Manosiz. > >> > > > > >> > > > On Wed, Jan 18, 2012 at 5:49 PM, Patrick Hunt <[EMAIL PROTECTED]> > >> > wrote: > >> > > > > >> > > > > On Wed, Jan 18, 2012 at 4:47 PM, Manosiz Bhattacharyya > >> > > > > <[EMAIL PROTECTED]> wrote: > >> > > > > > Thanks Patrick for your answer, > >> > > > > > >> > > > > No problem. > >> > > > > > >> > > > > > Actually we are in a virtualized environment, we have a FIO > disk > >> > for > >> > > > > > transactional logs. It does have some latency sometimes during > >> FIO > >> > > > > garbage > >> > > > > > collection. We know this could be the potential issue, but was |