|
Sampath Perera
2011-08-18, 03:40
Ted Dunning
2011-08-18, 05:19
Sampath Perera
2011-08-18, 07:15
Flavio Junqueira
2011-08-18, 09:13
Ted Dunning
2011-08-18, 15:36
Ted Dunning
2011-08-18, 15:39
Flavio Junqueira
2011-08-18, 15:54
Sampath Perera
2011-08-18, 16:54
Sampath Perera
2011-08-18, 16:55
Ted Dunning
2011-08-18, 17:13
Flavio Junqueira
2011-08-19, 11:00
Vishal Kher
2011-08-19, 20:13
Sampath Perera
2011-08-20, 02:23
Sampath Perera
2011-08-20, 02:30
|
-
Fast leader election initial delay, is that possible?Sampath Perera 2011-08-18, 03:40
Hi,
We have a deployment of a 3 node ZooKeeper quorum. When we get to starting the 3 ZooKeeper nodes the first node getting started prints the following connection refused exception, which is true as the node 2 and 3 are yet to be started. This seems to be because of the FastLeaderElection trying to connect to the other nodes specified in the quorum. So my question is whether it is possible to configure an initial delay for the FastLeaderElection to be kicked off? The rationale being that it is highly unlikely that all 3 nodes started at the same time, even in the case where we try to command the startups at the same time, and we could get rid of this stacktrace from the logs, as this will trigger warning on the tools that are monitoring the logs, yet is not actually a WARN rather an expected error. 2011-08-18 08:53:15,530 [-] [WorkerSender Thread] WARN QuorumCnxManager Cannot open channel to 2 at election address localhost/127.0.0.1:3888 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:371) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:340) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:360) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:333) at java.lang.Thread.run(Thread.java:662) 2011-08-18 08:53:15,532 [-] [WorkerSender Thread] WARN QuorumCnxManager Cannot open channel to 3 at election address localhost/127.0.0.1:3889 java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:371) at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:340) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:360) at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:333) at java.lang.Thread.run(Thread.java:662) -- Thanks, Sampath http://adroitlogic.org
-
Re: Fast leader election initial delay, is that possible?Ted Dunning 2011-08-18, 05:19
Well, it is exactly the same situation as any other situation where
connection to another server fails. There is no need to insert a delay here unless you think that there should be a delay before this particular warning should be considered a warning. Would you like to suggest a patch? On Wed, Aug 17, 2011 at 8:40 PM, Sampath Perera <[EMAIL PROTECTED]>wrote: > So my question is whether it is possible to configure an initial delay for > the FastLeaderElection to be kicked off? > > The rationale being that it is highly unlikely that all 3 nodes started at > the same time, even in the case where we try to command the startups at the > same time, and we could get rid of this stacktrace from the logs, as this > will trigger warning on the tools that are monitoring the logs, yet is not > actually a WARN rather an expected error. >
-
Re: Fast leader election initial delay, is that possible?Sampath Perera 2011-08-18, 07:15
Hi Ted,
Thanks for the quick response. On Thu, Aug 18, 2011 at 10:49 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Well, it is exactly the same situation as any other situation where > connection to another server fails. Hhmmm, I think this is a bit different isn't it? Here we know that the first server to come will be failing to connect to the other as they are not yet up. Anyway our real issue is the warning. > There is no need to insert a delay here > unless you think that there should be a delay before this particular > warning > should be considered a warning. > That would actually give a fix to the problem that we are seeing. > > Would you like to suggest a patch? > Of course I do.. will prepare a patch and attach. > > On Wed, Aug 17, 2011 at 8:40 PM, Sampath Perera <[EMAIL PROTECTED] > >wrote: > > > So my question is whether it is possible to configure an initial delay > for > > the FastLeaderElection to be kicked off? > > > > The rationale being that it is highly unlikely that all 3 nodes started > at > > the same time, even in the case where we try to command the startups at > the > > same time, and we could get rid of this stacktrace from the logs, as this > > will trigger warning on the tools that are monitoring the logs, yet is > not > > actually a WARN rather an expected error. > > > -- Thanks, Sampath http://adroitlogic.org
-
Re: Fast leader election initial delay, is that possible?Flavio Junqueira 2011-08-18, 09:13
Hi Sampath, When a server starts it tries to contact the others
immediately; it backs off if it gets no response. It is true that it is unlikely that servers will start at the same time and you'll get such warnings. However, I don't really see the point of setting such a configuration parameter. It is really difficult to estimate how much time is sufficient, so most likely you'll end up getting the warning anyway if you make an aggressive estimate or will wait more than necessary if you make a conservative estimate. -Flavio On Aug 18, 2011, at 5:40 AM, Sampath Perera wrote: > Hi, > > We have a deployment of a 3 node ZooKeeper quorum. When we get to > starting > the 3 ZooKeeper nodes the first node getting started prints the > following > connection refused exception, which is true as the node 2 and 3 are > yet to > be started. This seems to be because of the FastLeaderElection > trying to > connect to the other nodes specified in the quorum. > > So my question is whether it is possible to configure an initial > delay for > the FastLeaderElection to be kicked off? > > The rationale being that it is highly unlikely that all 3 nodes > started at > the same time, even in the case where we try to command the startups > at the > same time, and we could get rid of this stacktrace from the logs, as > this > will trigger warning on the tools that are monitoring the logs, yet > is not > actually a WARN rather an expected error. > > 2011-08-18 08:53:15,530 [-] [WorkerSender Thread] WARN > QuorumCnxManager > Cannot open channel to 2 at election address localhost/127.0.0.1:3888 > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) > at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) > at > org > .apache > .zookeeper > .server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:371) > at > org > .apache > .zookeeper > .server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:340) > at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger > $WorkerSender.process(FastLeaderElection.java:360) > at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger > $WorkerSender.run(FastLeaderElection.java:333) > at java.lang.Thread.run(Thread.java:662) > 2011-08-18 08:53:15,532 [-] [WorkerSender Thread] WARN > QuorumCnxManager > Cannot open channel to 3 at election address localhost/127.0.0.1:3889 > java.net.ConnectException: Connection refused > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) > at sun.nio.ch.SocketAdaptor.connect(SocketAdaptor.java:100) > at > org > .apache > .zookeeper > .server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:371) > at > org > .apache > .zookeeper > .server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:340) > at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger > $WorkerSender.process(FastLeaderElection.java:360) > at > org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger > $WorkerSender.run(FastLeaderElection.java:333) > at java.lang.Thread.run(Thread.java:662) > > -- > Thanks, > Sampath > http://adroitlogic.org flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: Fast leader election initial delay, is that possible?Ted Dunning 2011-08-18, 15:36
On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED]>wrote:
> > Hhmmm, I think this is a bit different isn't it? Here we know that the > first > server to come will be failing to connect to the other as they are not yet > up. Anyway our real issue is the warning. > We know that. But how does the server know that it is the first server? That is the whole point of the leader election. You might just have a server rejoining a cluster. Or you might have a cluster that has been turned off. Or a cluster with 2 out of 5 machines off and we tried to touch the other down machine before the others. > > > > Would you like to suggest a patch? > > > > Of course I do.. will prepare a patch and attach. > Great!
-
Re: Fast leader election initial delay, is that possible?Ted Dunning 2011-08-18, 15:39
Flavio,
What you say is correct, but the original poster does have a point that many of these warnings are to be expected and there is a heuristic that might assist in distinguishing some of these cases so that false alarms in the logs could be decreased. That doesn't seem like a big deal to me, but different people have different itches. In my experience, restarting a ZK cluster from zero almost never happens. On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED]>wrote: > >> >> Hhmmm, I think this is a bit different isn't it? Here we know that the >> first >> server to come will be failing to connect to the other as they are not yet >> up. Anyway our real issue is the warning. >> > > We know that. > > But how does the server know that it is the first server? That is the > whole point of the leader election. You might just have a server rejoining > a cluster. Or you might have a cluster that has been turned off. Or a > cluster with 2 out of 5 machines off and we tried to touch the other down > machine before the others. > > >> > >> > Would you like to suggest a patch? >> > >> >> Of course I do.. will prepare a patch and attach. >> > > Great! > >
-
Re: Fast leader election initial delay, is that possible?Flavio Junqueira 2011-08-18, 15:54
Hi Ted, I don't see how one can automate the distinction between a
machine that is down because it crashed and a machine that is down because it hasn't started yet. Assuming that we are logging the machine unavailability as we are doing currently, one can always look at the timestamp of the warning and remember that this is the time the machines were bootstrapping. Consequently, I don't really see the point of reducing the number of warnings, unless the warnings are really polluting the logs. I typically don't see so many that prevents me from reading the rest, but you may have a different perception. Also, recall that we back off, so the warnings become less frequent over time. I'm open to ideas, though. If you see anything wrong in my rationale or if you have an idea of how to do it differently, then I'd be happy to hear. However, if the idea is simply to add a parameter that configures the time for leader election to start, then I'm currently not in favor. -Flavio On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: > Flavio, > > What you say is correct, but the original poster does have a point > that many > of these warnings are to be expected and there is a heuristic that > might > assist in distinguishing some of these cases so that false alarms in > the > logs could be decreased. > > That doesn't seem like a big deal to me, but different people have > different > itches. In my experience, restarting a ZK cluster from zero almost > never > happens. > > On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > >> >> >> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] >> >wrote: >> >>> >>> Hhmmm, I think this is a bit different isn't it? Here we know that >>> the >>> first >>> server to come will be failing to connect to the other as they are >>> not yet >>> up. Anyway our real issue is the warning. >>> >> >> We know that. >> >> But how does the server know that it is the first server? That is >> the >> whole point of the leader election. You might just have a server >> rejoining >> a cluster. Or you might have a cluster that has been turned off. >> Or a >> cluster with 2 out of 5 machines off and we tried to touch the >> other down >> machine before the others. >> >> >>>> >>>> Would you like to suggest a patch? >>>> >>> >>> Of course I do.. will prepare a patch and attach. >>> >> >> Great! >> >> flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: Fast leader election initial delay, is that possible?Sampath Perera 2011-08-18, 16:54
Hi Flavio,
On Thu, Aug 18, 2011 at 9:24 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > Hi Ted, I don't see how one can automate the distinction between a machine > that is down because it crashed and a machine that is down because it hasn't > started yet. Assuming that we are logging the machine unavailability as we > are doing currently, one can always look at the timestamp of the warning and > remember that this is the time the machines were bootstrapping. > Consequently, I don't really see the point of reducing the number of > warnings, unless the warnings are really polluting the logs. I typically > don't see so many that prevents me from reading the rest, but you may have a > different perception. Also, recall that we back off, so the warnings become > less frequent over time. > True, but one of customer deployments have a log analyzing tool and sends notifications for the errors on the log, as you previously said we cannot get an optimal value for this timeout, but we can come up with a sub optimal value to get rid of this warning. > > I'm open to ideas, though. If you see anything wrong in my rationale or if > you have an idea of how to do it differently, then I'd be happy to hear. > However, if the idea is simply to add a parameter that configures the time > for leader election to start, then I'm currently not in favor. > Well, what I was originally looking for was to delay the leader election, but as pointed out by Ted, I was going to provide a path on printing this warning. (If you carefully look at Ted's comment, and my response, was thinking of a timeout for the warning to be considered as a warning to be printed on the log... at least that is what I got from Ted's first comment). What do you think about that? > > -Flavio > > On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: > > Flavio, > > What you say is correct, but the original poster does have a point that > many > of these warnings are to be expected and there is a heuristic that might > assist in distinguishing some of these cases so that false alarms in the > logs could be decreased. > > That doesn't seem like a big deal to me, but different people have > different > itches. In my experience, restarting a ZK cluster from zero almost never > happens. > > On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] > >wrote: > > > > Hhmmm, I think this is a bit different isn't it? Here we know that the > > first > > server to come will be failing to connect to the other as they are not yet > > up. Anyway our real issue is the warning. > > > > We know that. > > > But how does the server know that it is the first server? That is the > > whole point of the leader election. You might just have a server rejoining > > a cluster. Or you might have a cluster that has been turned off. Or a > > cluster with 2 out of 5 machines off and we tried to touch the other down > > machine before the others. > > > > > Would you like to suggest a patch? > > > > Of course I do.. will prepare a patch and attach. > > > > Great! > > > > > *flavio* > *junqueira* > > research scientist > > [EMAIL PROTECTED] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > > > -- Thanks, Sampath http://adroitlogic.org
-
Re: Fast leader election initial delay, is that possible?Sampath Perera 2011-08-18, 16:55
s/one of customer/one of our customer
sorry for the typo. On Thu, Aug 18, 2011 at 10:24 PM, Sampath Perera <[EMAIL PROTECTED]>wrote: > Hi Flavio, > > On Thu, Aug 18, 2011 at 9:24 PM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: > >> Hi Ted, I don't see how one can automate the distinction between a machine >> that is down because it crashed and a machine that is down because it hasn't >> started yet. Assuming that we are logging the machine unavailability as we >> are doing currently, one can always look at the timestamp of the warning and >> remember that this is the time the machines were bootstrapping. >> Consequently, I don't really see the point of reducing the number of >> warnings, unless the warnings are really polluting the logs. I typically >> don't see so many that prevents me from reading the rest, but you may have a >> different perception. Also, recall that we back off, so the warnings become >> less frequent over time. >> > > True, but one of customer deployments have a log analyzing tool and sends > notifications for the errors on the log, as you previously said we cannot > get an optimal value for this timeout, but we can come up with a sub optimal > value to get rid of this warning. > > >> >> I'm open to ideas, though. If you see anything wrong in my rationale or if >> you have an idea of how to do it differently, then I'd be happy to hear. >> However, if the idea is simply to add a parameter that configures the time >> for leader election to start, then I'm currently not in favor. >> > > Well, what I was originally looking for was to delay the leader election, > but as pointed out by Ted, I was going to provide a path on printing this > warning. (If you carefully look at Ted's comment, and my response, was > thinking of a timeout for the warning to be considered as a warning to be > printed on the log... at least that is what I got from Ted's first comment). > What do you think about that? > > >> >> -Flavio >> >> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: >> >> Flavio, >> >> What you say is correct, but the original poster does have a point that >> many >> of these warnings are to be expected and there is a heuristic that might >> assist in distinguishing some of these cases so that false alarms in the >> logs could be decreased. >> >> That doesn't seem like a big deal to me, but different people have >> different >> itches. In my experience, restarting a ZK cluster from zero almost never >> happens. >> >> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> >> >> >> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] >> >wrote: >> >> >> >> Hhmmm, I think this is a bit different isn't it? Here we know that the >> >> first >> >> server to come will be failing to connect to the other as they are not yet >> >> up. Anyway our real issue is the warning. >> >> >> >> We know that. >> >> >> But how does the server know that it is the first server? That is the >> >> whole point of the leader election. You might just have a server >> rejoining >> >> a cluster. Or you might have a cluster that has been turned off. Or a >> >> cluster with 2 out of 5 machines off and we tried to touch the other down >> >> machine before the others. >> >> >> >> >> Would you like to suggest a patch? >> >> >> >> Of course I do.. will prepare a patch and attach. >> >> >> >> Great! >> >> >> >> >> *flavio* >> *junqueira* >> >> research scientist >> >> [EMAIL PROTECTED] >> direct +34 93-183-8828 >> >> avinguda diagonal 177, 8th floor, barcelona, 08018, es >> phone (408) 349 3300 fax (408) 349 3301 >> >> >> > > > -- > Thanks, > Sampath > http://adroitlogic.org > > -- Thanks, Sampath http://adroitlogic.org
-
Re: Fast leader election initial delay, is that possible?Ted Dunning 2011-08-18, 17:13
The thought is that a server would not complain about connection refused or
inability to form a quorum during the first (say) twenty seconds of operation. The thesis is that warnings from these causes during that time are spurious. As I mentioned, I don't see this as urgent or even necessarily a good idea. I completely reboot a ZK cluster once every year or three. When I am doing a rolling upgrade, I *want* to see alerts when I bounce a machine. If I don't want to see those alerts, my monitoring system allows me to put a machine into maintenance mode for a short period of time to temporarily suppress the warnings. All I was doing was translating and elaborating the original poster's suggestion, not so much endorsing it. On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > Hi Ted, I don't see how one can automate the distinction between a machine > that is down because it crashed and a machine that is down because it hasn't > started yet. Assuming that we are logging the machine unavailability as we > are doing currently, one can always look at the timestamp of the warning and > remember that this is the time the machines were bootstrapping. > Consequently, I don't really see the point of reducing the number of > warnings, unless the warnings are really polluting the logs. I typically > don't see so many that prevents me from reading the rest, but you may have a > different perception. Also, recall that we back off, so the warnings become > less frequent over time. > > I'm open to ideas, though. If you see anything wrong in my rationale or if > you have an idea of how to do it differently, then I'd be happy to hear. > However, if the idea is simply to add a parameter that configures the time > for leader election to start, then I'm currently not in favor. > > -Flavio > > On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: > > Flavio, > > What you say is correct, but the original poster does have a point that > many > of these warnings are to be expected and there is a heuristic that might > assist in distinguishing some of these cases so that false alarms in the > logs could be decreased. > > That doesn't seem like a big deal to me, but different people have > different > itches. In my experience, restarting a ZK cluster from zero almost never > happens. > > On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] > >wrote: > > > > Hhmmm, I think this is a bit different isn't it? Here we know that the > > first > > server to come will be failing to connect to the other as they are not yet > > up. Anyway our real issue is the warning. > > > > We know that. > > > But how does the server know that it is the first server? That is the > > whole point of the leader election. You might just have a server rejoining > > a cluster. Or you might have a cluster that has been turned off. Or a > > cluster with 2 out of 5 machines off and we tried to touch the other down > > machine before the others. > > > > > Would you like to suggest a patch? > > > > Of course I do.. will prepare a patch and attach. > > > > Great! > > > > > *flavio* > *junqueira* > > research scientist > > [EMAIL PROTECTED] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > > >
-
Re: Fast leader election initial delay, is that possible?Flavio Junqueira 2011-08-19, 11:00
Sampath, Do you think something along the lines of what Ted describes
would work for you? -Flavio On Aug 18, 2011, at 7:13 PM, Ted Dunning wrote: > The thought is that a server would not complain about connection > refused or inability to form a quorum during the first (say) twenty > seconds of operation. > > The thesis is that warnings from these causes during that time are > spurious. > > As I mentioned, I don't see this as urgent or even necessarily a > good idea. I completely reboot a ZK cluster once every year or > three. When I am doing a rolling upgrade, I *want* to see alerts > when I bounce a machine. If I don't want to see those alerts, my > monitoring system allows me to put a machine into maintenance mode > for a short period of time to temporarily suppress the warnings. > > All I was doing was translating and elaborating the original > poster's suggestion, not so much endorsing it. > > On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <fpj@yahoo- > inc.com> wrote: > Hi Ted, I don't see how one can automate the distinction between a > machine that is down because it crashed and a machine that is down > because it hasn't started yet. Assuming that we are logging the > machine unavailability as we are doing currently, one can always > look at the timestamp of the warning and remember that this is the > time the machines were bootstrapping. Consequently, I don't really > see the point of reducing the number of warnings, unless the > warnings are really polluting the logs. I typically don't see so > many that prevents me from reading the rest, but you may have a > different perception. Also, recall that we back off, so the warnings > become less frequent over time. > > I'm open to ideas, though. If you see anything wrong in my rationale > or if you have an idea of how to do it differently, then I'd be > happy to hear. However, if the idea is simply to add a parameter > that configures the time for leader election to start, then I'm > currently not in favor. > > -Flavio > > On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: > >> Flavio, >> >> What you say is correct, but the original poster does have a point >> that many >> of these warnings are to be expected and there is a heuristic that >> might >> assist in distinguishing some of these cases so that false alarms >> in the >> logs could be decreased. >> >> That doesn't seem like a big deal to me, but different people have >> different >> itches. In my experience, restarting a ZK cluster from zero almost >> never >> happens. >> >> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning >> <[EMAIL PROTECTED]> wrote: >> >>> >>> >>> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] >>> >wrote: >>> >>>> >>>> Hhmmm, I think this is a bit different isn't it? Here we know >>>> that the >>>> first >>>> server to come will be failing to connect to the other as they >>>> are not yet >>>> up. Anyway our real issue is the warning. >>>> >>> >>> We know that. >>> >>> But how does the server know that it is the first server? That is >>> the >>> whole point of the leader election. You might just have a server >>> rejoining >>> a cluster. Or you might have a cluster that has been turned off. >>> Or a >>> cluster with 2 out of 5 machines off and we tried to touch the >>> other down >>> machine before the others. >>> >>> >>>>> >>>>> Would you like to suggest a patch? >>>>> >>>> >>>> Of course I do.. will prepare a patch and attach. >>>> >>> >>> Great! >>> >>> > > flavio > junqueira > > research scientist > > [EMAIL PROTECTED] > direct +34 93-183-8828 > > avinguda diagonal 177, 8th floor, barcelona, 08018, es > phone (408) 349 3300 fax (408) 349 3301 > > > > flavio junqueira research scientist [EMAIL PROTECTED] direct +34 93-183-8828 avinguda diagonal 177, 8th floor, barcelona, 08018, es phone (408) 349 3300 fax (408) 349 3301
-
Re: Fast leader election initial delay, is that possible?Vishal Kher 2011-08-19, 20:13
My few cents..
I am not sure if we can distinguish between spurious/non-spurious warnings and I don't think we can time it well. The delay is applicable only in certain cases. If the user knows that there will be a start up delay, then the user can ignore those errors or modify their scripts to start the server after a delay. Does this have to implemented in the server? I sounds me that this is something that user scripts should handle. On Fri, Aug 19, 2011 at 7:00 AM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > Sampath, Do you think something along the lines of what Ted describes would > work for you? > > -Flavio > > On Aug 18, 2011, at 7:13 PM, Ted Dunning wrote: > > The thought is that a server would not complain about connection refused or > inability to form a quorum during the first (say) twenty seconds of > operation. > > The thesis is that warnings from these causes during that time are > spurious. > > As I mentioned, I don't see this as urgent or even necessarily a good idea. > I completely reboot a ZK cluster once every year or three. When I am doing > a rolling upgrade, I *want* to see alerts when I bounce a machine. If I > don't want to see those alerts, my monitoring system allows me to put a > machine into maintenance mode for a short period of time to temporarily > suppress the warnings. > > All I was doing was translating and elaborating the original poster's > suggestion, not so much endorsing it. > > On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: > >> Hi Ted, I don't see how one can automate the distinction between a machine >> that is down because it crashed and a machine that is down because it hasn't >> started yet. Assuming that we are logging the machine unavailability as we >> are doing currently, one can always look at the timestamp of the warning and >> remember that this is the time the machines were bootstrapping. >> Consequently, I don't really see the point of reducing the number of >> warnings, unless the warnings are really polluting the logs. I typically >> don't see so many that prevents me from reading the rest, but you may have a >> different perception. Also, recall that we back off, so the warnings become >> less frequent over time. >> >> I'm open to ideas, though. If you see anything wrong in my rationale or if >> you have an idea of how to do it differently, then I'd be happy to hear. >> However, if the idea is simply to add a parameter that configures the time >> for leader election to start, then I'm currently not in favor. >> >> -Flavio >> >> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: >> >> Flavio, >> >> What you say is correct, but the original poster does have a point that >> many >> of these warnings are to be expected and there is a heuristic that might >> assist in distinguishing some of these cases so that false alarms in the >> logs could be decreased. >> >> That doesn't seem like a big deal to me, but different people have >> different >> itches. In my experience, restarting a ZK cluster from zero almost never >> happens. >> >> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> >> >> >> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] >> >wrote: >> >> >> >> Hhmmm, I think this is a bit different isn't it? Here we know that the >> >> first >> >> server to come will be failing to connect to the other as they are not yet >> >> up. Anyway our real issue is the warning. >> >> >> >> We know that. >> >> >> But how does the server know that it is the first server? That is the >> >> whole point of the leader election. You might just have a server >> rejoining >> >> a cluster. Or you might have a cluster that has been turned off. Or a >> >> cluster with 2 out of 5 machines off and we tried to touch the other down >> >> machine before the others. >> >> >> >> >> Would you like to suggest a patch? >> >> >> >> Of course I do.. will prepare a patch and attach. >> >> >> >> Great! >> >> >> >>
-
Re: Fast leader election initial delay, is that possible?Sampath Perera 2011-08-20, 02:23
Yeah, that will work for me.
Also, it just is going to be a configuration and the overhead introduced will only be applicable to the case where this error occurs, as it is just an if statement before printing out the error. The default behavior will not be changed and I do not expect any overhead to be introduced with this to the default case. OTOH, I am OK to leave it as it is, and let our customer know that, that is how it is :-) Actually my original intention was to find whether there is any such configuration, as I was unable to find it on the docs. So, if the majority of dev's are not in favour of this change I would not do this. Thanks for all your feedback! Sampath On Fri, Aug 19, 2011 at 4:30 PM, Flavio Junqueira <[EMAIL PROTECTED]> wrote: > Sampath, Do you think something along the lines of what Ted describes would > work for you? > > -Flavio > > On Aug 18, 2011, at 7:13 PM, Ted Dunning wrote: > > The thought is that a server would not complain about connection refused or > inability to form a quorum during the first (say) twenty seconds of > operation. > > The thesis is that warnings from these causes during that time are > spurious. > > As I mentioned, I don't see this as urgent or even necessarily a good idea. > I completely reboot a ZK cluster once every year or three. When I am doing > a rolling upgrade, I *want* to see alerts when I bounce a machine. If I > don't want to see those alerts, my monitoring system allows me to put a > machine into maintenance mode for a short period of time to temporarily > suppress the warnings. > > All I was doing was translating and elaborating the original poster's > suggestion, not so much endorsing it. > > On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: > >> Hi Ted, I don't see how one can automate the distinction between a machine >> that is down because it crashed and a machine that is down because it hasn't >> started yet. Assuming that we are logging the machine unavailability as we >> are doing currently, one can always look at the timestamp of the warning and >> remember that this is the time the machines were bootstrapping. >> Consequently, I don't really see the point of reducing the number of >> warnings, unless the warnings are really polluting the logs. I typically >> don't see so many that prevents me from reading the rest, but you may have a >> different perception. Also, recall that we back off, so the warnings become >> less frequent over time. >> >> I'm open to ideas, though. If you see anything wrong in my rationale or if >> you have an idea of how to do it differently, then I'd be happy to hear. >> However, if the idea is simply to add a parameter that configures the time >> for leader election to start, then I'm currently not in favor. >> >> -Flavio >> >> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: >> >> Flavio, >> >> What you say is correct, but the original poster does have a point that >> many >> of these warnings are to be expected and there is a heuristic that might >> assist in distinguishing some of these cases so that false alarms in the >> logs could be decreased. >> >> That doesn't seem like a big deal to me, but different people have >> different >> itches. In my experience, restarting a ZK cluster from zero almost never >> happens. >> >> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> >> >> >> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera <[EMAIL PROTECTED] >> >wrote: >> >> >> >> Hhmmm, I think this is a bit different isn't it? Here we know that the >> >> first >> >> server to come will be failing to connect to the other as they are not yet >> >> up. Anyway our real issue is the warning. >> >> >> >> We know that. >> >> >> But how does the server know that it is the first server? That is the >> >> whole point of the leader election. You might just have a server >> rejoining >> >> a cluster. Or you might have a cluster that has been turned off. Or a >> >> cluster with 2 out of 5 machines off and we tried to touch the other down Thanks, Sampath http://adroitlogic.org
-
Re: Fast leader election initial delay, is that possible?Sampath Perera 2011-08-20, 02:30
Hi Vishal,
On Sat, Aug 20, 2011 at 1:43 AM, Vishal Kher <[EMAIL PROTECTED]> wrote: > My few cents.. > I am not sure if we can distinguish between spurious/non-spurious warnings > and I don't think we can time it well. The delay is applicable only in > certain cases. If the user knows that there will be a start up delay, then > the user can ignore those errors or modify their scripts to start the server > after a delay. I guess you misinterpreted it :-( starting the server after delay is not a solution for the original problem that I was referring to. I do not also see it possible to get my original problem fixed through a script. At least I do not know how to do it. May be changing the log level to something like FATAL and reverting it back to INFO after the delay?? I do not think that is a good idea as that will cut off some of the stuff that I want to see. > Does this have to implemented in the server? I sounds me that this is > something that user scripts should handle. > As I said I do not see how the user script can handle this? if there is any option please do let me know. Sampath > > > On Fri, Aug 19, 2011 at 7:00 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: > >> Sampath, Do you think something along the lines of what Ted describes >> would work for you? >> >> -Flavio >> >> On Aug 18, 2011, at 7:13 PM, Ted Dunning wrote: >> >> The thought is that a server would not complain about connection refused >> or inability to form a quorum during the first (say) twenty seconds of >> operation. >> >> The thesis is that warnings from these causes during that time are >> spurious. >> >> As I mentioned, I don't see this as urgent or even necessarily a good >> idea. I completely reboot a ZK cluster once every year or three. When I am >> doing a rolling upgrade, I *want* to see alerts when I bounce a machine. If >> I don't want to see those alerts, my monitoring system allows me to put a >> machine into maintenance mode for a short period of time to temporarily >> suppress the warnings. >> >> All I was doing was translating and elaborating the original poster's >> suggestion, not so much endorsing it. >> >> On Thu, Aug 18, 2011 at 8:54 AM, Flavio Junqueira <[EMAIL PROTECTED]>wrote: >> >>> Hi Ted, I don't see how one can automate the distinction between a >>> machine that is down because it crashed and a machine that is down because >>> it hasn't started yet. Assuming that we are logging the machine >>> unavailability as we are doing currently, one can always look at the >>> timestamp of the warning and remember that this is the time the machines >>> were bootstrapping. Consequently, I don't really see the point of reducing >>> the number of warnings, unless the warnings are really polluting the logs. I >>> typically don't see so many that prevents me from reading the rest, but you >>> may have a different perception. Also, recall that we back off, so the >>> warnings become less frequent over time. >>> >>> I'm open to ideas, though. If you see anything wrong in my rationale or >>> if you have an idea of how to do it differently, then I'd be happy to hear. >>> However, if the idea is simply to add a parameter that configures the time >>> for leader election to start, then I'm currently not in favor. >>> >>> -Flavio >>> >>> On Aug 18, 2011, at 5:39 PM, Ted Dunning wrote: >>> >>> Flavio, >>> >>> What you say is correct, but the original poster does have a point that >>> many >>> of these warnings are to be expected and there is a heuristic that might >>> assist in distinguishing some of these cases so that false alarms in the >>> logs could be decreased. >>> >>> That doesn't seem like a big deal to me, but different people have >>> different >>> itches. In my experience, restarting a ZK cluster from zero almost never >>> happens. >>> >>> On Thu, Aug 18, 2011 at 8:36 AM, Ted Dunning <[EMAIL PROTECTED]> >>> wrote: >>> >>> >>> >>> On Thu, Aug 18, 2011 at 12:15 AM, Sampath Perera < >>> [EMAIL PROTECTED]>wrote: >> Thanks, Sampath http://adroitlogic.org |