|
Ted Dunning
2012-03-15, 18:50
=?KOI8-U?B?96bUwcymyiD0yc...
2012-03-15, 20:29
Ted Dunning
2012-03-16, 05:41
Christian Ziech
2012-03-14, 16:04
Christian Ziech
2012-03-14, 17:01
Alexander Shraer
2012-03-15, 06:46
Christian Ziech
2012-03-15, 09:50
Alexander Shraer
2012-03-15, 15:33
Alexander Shraer
2012-03-15, 22:54
Alexander Shraer
2012-03-16, 03:43
Christian Ziech
2012-03-16, 09:56
Ted Dunning
2012-03-16, 15:51
Alexander Shraer
2012-03-16, 18:37
Christian Ziech
2012-03-19, 12:11
Benjamin Reed
2012-03-16, 18:15
|
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Ted Dunning 2012-03-15, 18:50
Alexander's comment still applies.
VM's can function or go away completely, but they can also malfunction in more subtle ways such that they just go VEEEERRRRY slowly. You have to account for that failure mode. These failures can even be transient. This would probably break your approach. On 3/15/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > Oh sorry there is a slight misunderstanding. With VM I did not mean the java > vm but the Linux vm that contains the zookeeper node. We get notified if > that goes away and is repurposed. > > BR > Christian > > Gesendet von meinem Nokia Lumia 800 > ________________________________ > Von: ext Alexander Shraer > Gesendet: 15.03.2012 16:33 > An: [EMAIL PROTECTED]; Ziech Christian (Nokia-LC/Berlin) > Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 > > yes, by replacing x at a time from 2x+1 you have quorum intersection. > > i have one more question - zookeeper itself doesn't assume perfect > failure detection, which your scheme requires. what if the VM didn't > actually fail but just slow and then tries to reconnect ? > > On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech > <[EMAIL PROTECTED]> wrote: >> I don't think that we could be running into a split brain problem in our >> use >> case. >> Let me try to describe the scenario we are worried about (assuming an >> ensemble of 5 nodes A,B,C,D,E): >> - The ensemble is up and running and in sync >> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down >> because the VM has gone away >> - That removal of the VM is detected and a new VM is spawned with the same >> host name "zookeeperA.whatever-domain.priv" - let's call that node A' >> - Node A' zookeeper wants to join the cluster - right now this gets >> rejected >> by the others since A' has a different IP address than A (and the old one >> is >> "cached" in the InetSocketAddress of the QuorumPeer instance) >> >> We could ensure that at any given time there is only at most one node with >> host name "zookeeperA.whatever-domain.priv" known by the ensemble and that >> once one node is replaced, it would not come back. Also we could make sure >> that our ensemble is big enough to compensate for a replacement of more >> than >> x nodes at a time (setting it to x*2 + 1 nodes). >> >> So if I did not misestimate our problem it should be (due to the >> restrictions) simpler than the problem to be solved by zookeeper-107. My >> intention is basically by solving this smaller discrete problem to not >> need >> to wait for that zookeeper-107 makes it into a release (the assumption is >> that a smaller fix has a possibly a chance to make it into the 3.4.x >> branch >> even). >> >> Am 15.03.2012 07:46, schrieb ext Alexander Shraer: >>> >>> Hi Christian, >>> >>> ZK-107 would indeed allow you to add/remove servers and change their >>> addresses. >>> >>> > We could ensure that we always have a more or less fixed quorum of >>> > zookeeper servers with a fixed set of host names. >>> >>> You should probably also ensure that a majority of the old ensemble >>> intersects with a majority of the new one. >>> Otherwise you have to run a reconfiguration protocol similarly to ZK-107. >>> For example, if you have 3 servers A B and C, and now you're adding D and >>> E >>> that replace B and C, how would this work ? it is probable that D and E >>> don't have the latest state (as you mention) and A is down or doesn't >>> have >>> the latest state too (a minority might not have the latest state). Also, >>> how >>> do you prevent split brain in this case ? meaning B and C thinking that >>> they >>> are still operational ? perhaps I'm missing something but I suspect that >>> the >>> change you propose won't be enough... >>> >>> Best Regards, >>> Alex >>> >>> >>> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech >>> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: >>> >>> Just a small addition: In my opinion the patch could really boil +
Ted Dunning 2012-03-15, 18:50
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107=?KOI8-U?B?96bUwcymyiD0yc... 2012-03-15, 20:29
As for me this can work in given environment (where is it externally
ensured that old server is down until new one is up), but can't be introduced as generic zookeeper patch. Simply because without such an external ensurance, two servers with different IPs and same name can be recognized by different parts of the ensemble. BTW: Are there possibility of such a problem now with dynamic DNS and servers started at different DNS states? As for me, better option would be to assign IPs to your VMs dynamically. E.g. when you think given server should become serverA, you should simply add IPa as an alias IP for it. ARP will ensure you won't have two As and you don't need to touch the code. Best regards, Vitalii Tymchyshyn +
=?KOI8-U?B?96bUwcymyiD0yc... 2012-03-15, 20:29
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Ted Dunning 2012-03-16, 05:41
The problem is that the old server could come back after you have decided
it was gone. You now have an inconsistent cluster where the old server is saying things that the other servers are not properly responding to. On Thu, Mar 15, 2012 at 8:35 PM, <[EMAIL PROTECTED]> wrote: > I agree with your points about any kind of VMs having a hard to predict > runtime behaviour and that participants of the zookeeper ensemble running > on a VM could fail to send keep-alives for an uncertain amount of time. But > I don't yet understand how that would break the approach I was mentioning: > Just trying to re-resolve the InetAddress after an IO exception should in > that case still lead to the same original IP address (and eventually to > that node rejoining the ensemble). > Only if that host name (the old node was using) would be re-assigned to > another instance this step of re-resolving would point to a new IP (and > hence cause the old server to be replaced). > > Did I understand your objection correctly? > > ________________________________________ > Von: ext Ted Dunning [[EMAIL PROTECTED]] > Gesendet: Donnerstag, 15. März 2012 19:50 > Bis: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 > > Alexander's comment still applies. > > VM's can function or go away completely, but they can also malfunction > in more subtle ways such that they just go VEEEERRRRY slowly. You > have to account for that failure mode. These failures can even be > transient. > > This would probably break your approach. > > On 3/15/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > Oh sorry there is a slight misunderstanding. With VM I did not mean the > java > > vm but the Linux vm that contains the zookeeper node. We get notified if > > that goes away and is repurposed. > > > > BR > > Christian > > > > Gesendet von meinem Nokia Lumia 800 > > ________________________________ > > Von: ext Alexander Shraer > > Gesendet: 15.03.2012 16:33 > > An: [EMAIL PROTECTED]; Ziech Christian (Nokia-LC/Berlin) > > Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 > > > > yes, by replacing x at a time from 2x+1 you have quorum intersection. > > > > i have one more question - zookeeper itself doesn't assume perfect > > failure detection, which your scheme requires. what if the VM didn't > > actually fail but just slow and then tries to reconnect ? > > > > On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech > > <[EMAIL PROTECTED]> wrote: > >> I don't think that we could be running into a split brain problem in our > >> use > >> case. > >> Let me try to describe the scenario we are worried about (assuming an > >> ensemble of 5 nodes A,B,C,D,E): > >> - The ensemble is up and running and in sync > >> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down > >> because the VM has gone away > >> - That removal of the VM is detected and a new VM is spawned with the > same > >> host name "zookeeperA.whatever-domain.priv" - let's call that node A' > >> - Node A' zookeeper wants to join the cluster - right now this gets > >> rejected > >> by the others since A' has a different IP address than A (and the old > one > >> is > >> "cached" in the InetSocketAddress of the QuorumPeer instance) > >> > >> We could ensure that at any given time there is only at most one node > with > >> host name "zookeeperA.whatever-domain.priv" known by the ensemble and > that > >> once one node is replaced, it would not come back. Also we could make > sure > >> that our ensemble is big enough to compensate for a replacement of more > >> than > >> x nodes at a time (setting it to x*2 + 1 nodes). > >> > >> So if I did not misestimate our problem it should be (due to the > >> restrictions) simpler than the problem to be solved by zookeeper-107. My > >> intention is basically by solving this smaller discrete problem to not > >> need > >> to wait for that zookeeper-107 makes it into a release (the assumption +
Ted Dunning 2012-03-16, 05:41
-
Zookeeper on short lived VMs and ZOOKEEPER-107Christian Ziech 2012-03-14, 16:04
LEt me describe our upcoming use case in a few words: We are planning to
use zookeeper in a cloud were typically nodes come and go unpredictably. We could ensure that we always have a more or less fixed quorum of zookeeper servers with a fixed set of host names. However the IPs associated with the host names would change every time a new server comes up. I browsed the code a little and it seems right now that the only problem is that the zookeeper server is remembering the resolved InetSocketAddress in its QuorumPeer hash map. I saw that possibly ZOOKEEPER-107 would also solve that problem but possibly in a more generic way than actually needed (perhaps here I underestimate the impact of joining as a server with an empty data directory to replace a server that previously had one). Given that - from looking at ZOOKEEPER-107 - it seems that it will still take some time for the proposed fix to make it into a release, would it make sense to invest time into a smaller work fix just for this "replacing a dropped server without rolling restarts" use case? Would there be a chance that a fix for this makes it into the 3.4.x branch? Are there perhaps other ways to get this use case supported without the need for doing rolling restarts whenever we need to replace one of the zookeeper servers? +
Christian Ziech 2012-03-14, 16:04
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Christian Ziech 2012-03-14, 17:01
Just a small addition: In my opinion the patch could really boil down to
add a quorumServer.electionAddr = new InetSocketAddress(electionAddr.getHostName(), electionAddr.getPort()); in the catch(IOException e) clause of the connectOne() method of the QuorumCnxManager. In addition on should perhaps make the electionAddr field in the QuorumPeer.QuorumServer class volatile to prevent races. I haven't checked this change yet fully for implications but doing a quick test on some machines at least showed it would solve our use case. What do the more expert users / maintainers think - is it even worthwhile to go that route? Am 14.03.2012 17:04, schrieb ext Christian Ziech: > LEt me describe our upcoming use case in a few words: We are planning > to use zookeeper in a cloud were typically nodes come and go > unpredictably. We could ensure that we always have a more or less > fixed quorum of zookeeper servers with a fixed set of host names. > However the IPs associated with the host names would change every time > a new server comes up. I browsed the code a little and it seems right > now that the only problem is that the zookeeper server is remembering > the resolved InetSocketAddress in its QuorumPeer hash map. > > I saw that possibly ZOOKEEPER-107 would also solve that problem but > possibly in a more generic way than actually needed (perhaps here I > underestimate the impact of joining as a server with an empty data > directory to replace a server that previously had one). > > Given that - from looking at ZOOKEEPER-107 - it seems that it will > still take some time for the proposed fix to make it into a release, > would it make sense to invest time into a smaller work fix just for > this "replacing a dropped server without rolling restarts" use case? > Would there be a chance that a fix for this makes it into the 3.4.x > branch? > > Are there perhaps other ways to get this use case supported without > the need for doing rolling restarts whenever we need to replace one of > the zookeeper servers? > -- *NOKIA* *Christian Ziech* Senior Software Developer Context Based Services Services & Software Mobile: +4915155155740 Fax: +493044676555 eMail: [EMAIL PROTECTED] Nokia gate5 GmbH Invalidenstr. 117 10115 Berlin, Germany www.maps.nokia.com <http://www.maps.nokia.com> www.smart2go.com <http://www.smart2go.com> Nokia gate5 GmbH, Sitz der Gesellschaft: Berlin, Amtsgericht Charlottenburg: HRB 106443 B, Steuernr.: 37/222/20817, ID/VAT-Nr.: DE 812 845 193, Geschᅵftsfᅵhrer: Dr. Michael Halbherr, Karim Tᅵhtivuori +
Christian Ziech 2012-03-14, 17:01
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Alexander Shraer 2012-03-15, 06:46
Hi Christian,
ZK-107 would indeed allow you to add/remove servers and change their addresses. > We could ensure that we always have a more or less fixed quorum of zookeeper servers with a fixed set of host names. You should probably also ensure that a majority of the old ensemble intersects with a majority of the new one. Otherwise you have to run a reconfiguration protocol similarly to ZK-107. For example, if you have 3 servers A B and C, and now you're adding D and E that replace B and C, how would this work ? it is probable that D and E don't have the latest state (as you mention) and A is down or doesn't have the latest state too (a minority might not have the latest state). Also, how do you prevent split brain in this case ? meaning B and C thinking that they are still operational ? perhaps I'm missing something but I suspect that the change you propose won't be enough... Best Regards, Alex On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech <[EMAIL PROTECTED] > wrote: > Just a small addition: In my opinion the patch could really boil down to > add a > > quorumServer.electionAddr = new > InetSocketAddress(**electionAddr.getHostName(), electionAddr.getPort()); > > in the catch(IOException e) clause of the connectOne() method of the > QuorumCnxManager. In addition on should perhaps make the electionAddr field > in the QuorumPeer.QuorumServer class volatile to prevent races. > > I haven't checked this change yet fully for implications but doing a quick > test on some machines at least showed it would solve our use case. What do > the more expert users / maintainers think - is it even worthwhile to go > that route? > > Am 14.03.2012 17:04, schrieb ext Christian Ziech: > > LEt me describe our upcoming use case in a few words: We are planning to >> use zookeeper in a cloud were typically nodes come and go unpredictably. We >> could ensure that we always have a more or less fixed quorum of zookeeper >> servers with a fixed set of host names. However the IPs associated with the >> host names would change every time a new server comes up. I browsed the >> code a little and it seems right now that the only problem is that the >> zookeeper server is remembering the resolved InetSocketAddress in its >> QuorumPeer hash map. >> >> I saw that possibly ZOOKEEPER-107 would also solve that problem but >> possibly in a more generic way than actually needed (perhaps here I >> underestimate the impact of joining as a server with an empty data >> directory to replace a server that previously had one). >> >> Given that - from looking at ZOOKEEPER-107 - it seems that it will still >> take some time for the proposed fix to make it into a release, would it >> make sense to invest time into a smaller work fix just for this "replacing >> a dropped server without rolling restarts" use case? Would there be a >> chance that a fix for this makes it into the 3.4.x branch? >> >> Are there perhaps other ways to get this use case supported without the >> need for doing rolling restarts whenever we need to replace one of the >> zookeeper servers? >> >> > > -- > *NOKIA* > *Christian Ziech* > Senior Software Developer > Context Based Services > Services & Software > Mobile: +4915155155740 > Fax: +493044676555 > eMail: [EMAIL PROTECTED] > Nokia gate5 GmbH > Invalidenstr. 117 > 10115 Berlin, Germany > www.maps.nokia.com <http://www.maps.nokia.com> > www.smart2go.com <http://www.smart2go.com> > > Nokia gate5 GmbH, Sitz der Gesellschaft: Berlin, Amtsgericht > Charlottenburg: HRB 106443 B, Steuernr.: 37/222/20817, ID/VAT-Nr.: DE 812 > 845 193, Geschäftsführer: Dr. Michael Halbherr, Karim Tähtivuori > +
Alexander Shraer 2012-03-15, 06:46
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Christian Ziech 2012-03-15, 09:50
I don't think that we could be running into a split brain problem in our
use case. Let me try to describe the scenario we are worried about (assuming an ensemble of 5 nodes A,B,C,D,E): - The ensemble is up and running and in sync - Node A with the host name "zookeeperA.whatever-domain.priv" goes down because the VM has gone away - That removal of the VM is detected and a new VM is spawned with the same host name "zookeeperA.whatever-domain.priv" - let's call that node A' - Node A' zookeeper wants to join the cluster - right now this gets rejected by the others since A' has a different IP address than A (and the old one is "cached" in the InetSocketAddress of the QuorumPeer instance) We could ensure that at any given time there is only at most one node with host name "zookeeperA.whatever-domain.priv" known by the ensemble and that once one node is replaced, it would not come back. Also we could make sure that our ensemble is big enough to compensate for a replacement of more than x nodes at a time (setting it to x*2 + 1 nodes). So if I did not misestimate our problem it should be (due to the restrictions) simpler than the problem to be solved by zookeeper-107. My intention is basically by solving this smaller discrete problem to not need to wait for that zookeeper-107 makes it into a release (the assumption is that a smaller fix has a possibly a chance to make it into the 3.4.x branch even). Am 15.03.2012 07:46, schrieb ext Alexander Shraer: > Hi Christian, > > ZK-107 would indeed allow you to add/remove servers and change their > addresses. > > > We could ensure that we always have a more or less fixed quorum of > zookeeper servers with a fixed set of host names. > > You should probably also ensure that a majority of the old ensemble > intersects with a majority of the new one. > Otherwise you have to run a reconfiguration protocol similarly to > ZK-107. For example, if you have 3 servers A B and C, and now you're > adding D and E that replace B and C, how would this work ? it is > probable that D and E don't have the latest state (as you mention) and > A is down or doesn't have the latest state too (a minority might not > have the latest state). Also, how do you prevent split brain in this > case ? meaning B and C thinking that they are still operational ? > perhaps I'm missing something but I suspect that the change you > propose won't be enough... > > Best Regards, > Alex > > > On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Just a small addition: In my opinion the patch could really boil > down to add a > > quorumServer.electionAddr = new > InetSocketAddress(electionAddr.getHostName(), > electionAddr.getPort()); > > in the catch(IOException e) clause of the connectOne() method of > the QuorumCnxManager. In addition on should perhaps make the > electionAddr field in the QuorumPeer.QuorumServer class volatile > to prevent races. > > I haven't checked this change yet fully for implications but doing > a quick test on some machines at least showed it would solve our > use case. What do the more expert users / maintainers think - is > it even worthwhile to go that route? > > Am 14.03.2012 17:04, schrieb ext Christian Ziech: > > LEt me describe our upcoming use case in a few words: We are > planning to use zookeeper in a cloud were typically nodes come > and go unpredictably. We could ensure that we always have a > more or less fixed quorum of zookeeper servers with a fixed > set of host names. However the IPs associated with the host > names would change every time a new server comes up. I browsed > the code a little and it seems right now that the only problem > is that the zookeeper server is remembering the resolved > InetSocketAddress in its QuorumPeer hash map. > > I saw that possibly ZOOKEEPER-107 would also solve that +
Christian Ziech 2012-03-15, 09:50
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Alexander Shraer 2012-03-15, 15:33
yes, by replacing x at a time from 2x+1 you have quorum intersection.
i have one more question - zookeeper itself doesn't assume perfect failure detection, which your scheme requires. what if the VM didn't actually fail but just slow and then tries to reconnect ? On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech <[EMAIL PROTECTED]> wrote: > I don't think that we could be running into a split brain problem in our use > case. > Let me try to describe the scenario we are worried about (assuming an > ensemble of 5 nodes A,B,C,D,E): > - The ensemble is up and running and in sync > - Node A with the host name "zookeeperA.whatever-domain.priv" goes down > because the VM has gone away > - That removal of the VM is detected and a new VM is spawned with the same > host name "zookeeperA.whatever-domain.priv" - let's call that node A' > - Node A' zookeeper wants to join the cluster - right now this gets rejected > by the others since A' has a different IP address than A (and the old one is > "cached" in the InetSocketAddress of the QuorumPeer instance) > > We could ensure that at any given time there is only at most one node with > host name "zookeeperA.whatever-domain.priv" known by the ensemble and that > once one node is replaced, it would not come back. Also we could make sure > that our ensemble is big enough to compensate for a replacement of more than > x nodes at a time (setting it to x*2 + 1 nodes). > > So if I did not misestimate our problem it should be (due to the > restrictions) simpler than the problem to be solved by zookeeper-107. My > intention is basically by solving this smaller discrete problem to not need > to wait for that zookeeper-107 makes it into a release (the assumption is > that a smaller fix has a possibly a chance to make it into the 3.4.x branch > even). > > Am 15.03.2012 07:46, schrieb ext Alexander Shraer: >> >> Hi Christian, >> >> ZK-107 would indeed allow you to add/remove servers and change their >> addresses. >> >> > We could ensure that we always have a more or less fixed quorum of >> > zookeeper servers with a fixed set of host names. >> >> You should probably also ensure that a majority of the old ensemble >> intersects with a majority of the new one. >> Otherwise you have to run a reconfiguration protocol similarly to ZK-107. >> For example, if you have 3 servers A B and C, and now you're adding D and E >> that replace B and C, how would this work ? it is probable that D and E >> don't have the latest state (as you mention) and A is down or doesn't have >> the latest state too (a minority might not have the latest state). Also, how >> do you prevent split brain in this case ? meaning B and C thinking that they >> are still operational ? perhaps I'm missing something but I suspect that the >> change you propose won't be enough... >> >> Best Regards, >> Alex >> >> >> On Wed, Mar 14, 2012 at 10:01 AM, Christian Ziech >> <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: >> >> Just a small addition: In my opinion the patch could really boil >> down to add a >> >> quorumServer.electionAddr = new >> InetSocketAddress(electionAddr.getHostName(), >> electionAddr.getPort()); >> >> in the catch(IOException e) clause of the connectOne() method of >> the QuorumCnxManager. In addition on should perhaps make the >> electionAddr field in the QuorumPeer.QuorumServer class volatile >> to prevent races. >> >> I haven't checked this change yet fully for implications but doing >> a quick test on some machines at least showed it would solve our >> use case. What do the more expert users / maintainers think - is >> it even worthwhile to go that route? >> >> Am 14.03.2012 17:04, schrieb ext Christian Ziech: >> >> LEt me describe our upcoming use case in a few words: We are >> planning to use zookeeper in a cloud were typically nodes come >> and go unpredictably. We could ensure that we always have a >> more or less fixed quorum of zookeeper servers with a fixed +
Alexander Shraer 2012-03-15, 15:33
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Alexander Shraer 2012-03-15, 22:54
I think the concern is that the old VM can recover and try to
reconnect. Theoretically you could even go back and forth between new and old VM. For example, suppose that you have servers A, B and C in the cluster, A is the leader. C is slow and "replaced" with C', then update U is acked by A and C', then A fails. In this situation you cannot have additional failures. But with the automatic replacement thing it can (theoretically) happen that C' becomes a little slow, C connects to B and is chosen as leader, and the committed update U is lost forever. This is perhaps unlikely but possible... Alex On Thu, Mar 15, 2012 at 1:35 PM, <[EMAIL PROTECTED]> wrote: > I agree with your points about any kind of VMs having a hard to predict runtime behaviour and that participants of the zookeeper ensemble running on a VM could fail to send keep-alives for an uncertain amount of time. But I don't yet understand how that would break the approach I was mentioning: Just trying to re-resolve the InetAddress after an IO exception should in that case still lead to the same original IP address (and eventually to that node rejoining the ensemble). > Only if that host name (the old node was using) would be re-assigned to another instance this step of re-resolving would point to a new IP (and hence cause the old server to be replaced). > > Did I understand your objection correctly? > > ________________________________________ > Von: ext Ted Dunning [[EMAIL PROTECTED]] > Gesendet: Donnerstag, 15. März 2012 19:50 > Bis: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 > > Alexander's comment still applies. > > VM's can function or go away completely, but they can also malfunction > in more subtle ways such that they just go VEEEERRRRY slowly. You > have to account for that failure mode. These failures can even be > transient. > > This would probably break your approach. > > On 3/15/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >> Oh sorry there is a slight misunderstanding. With VM I did not mean the java >> vm but the Linux vm that contains the zookeeper node. We get notified if >> that goes away and is repurposed. >> >> BR >> Christian >> >> Gesendet von meinem Nokia Lumia 800 >> ________________________________ >> Von: ext Alexander Shraer >> Gesendet: 15.03.2012 16:33 >> An: [EMAIL PROTECTED]; Ziech Christian (Nokia-LC/Berlin) >> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >> >> yes, by replacing x at a time from 2x+1 you have quorum intersection. >> >> i have one more question - zookeeper itself doesn't assume perfect >> failure detection, which your scheme requires. what if the VM didn't >> actually fail but just slow and then tries to reconnect ? >> >> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech >> <[EMAIL PROTECTED]> wrote: >>> I don't think that we could be running into a split brain problem in our >>> use >>> case. >>> Let me try to describe the scenario we are worried about (assuming an >>> ensemble of 5 nodes A,B,C,D,E): >>> - The ensemble is up and running and in sync >>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down >>> because the VM has gone away >>> - That removal of the VM is detected and a new VM is spawned with the same >>> host name "zookeeperA.whatever-domain.priv" - let's call that node A' >>> - Node A' zookeeper wants to join the cluster - right now this gets >>> rejected >>> by the others since A' has a different IP address than A (and the old one >>> is >>> "cached" in the InetSocketAddress of the QuorumPeer instance) >>> >>> We could ensure that at any given time there is only at most one node with >>> host name "zookeeperA.whatever-domain.priv" known by the ensemble and that >>> once one node is replaced, it would not come back. Also we could make sure >>> that our ensemble is big enough to compensate for a replacement of more >>> than >>> x nodes at a time (setting it to x*2 + 1 nodes). +
Alexander Shraer 2012-03-15, 22:54
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Alexander Shraer 2012-03-16, 03:43
Actually its still not clear to me how you would enforce the 2x+1. In Zookeeper we can guarantee liveness (progress) only when x+1 are connected and up, however safety (correctness) is always guaranteed, even if 2 out of 3 servers are temporarily down. Your design needs the 2x+1 for safety, which I think is problematic unless you can accurately detect failures (synchrony) and failures are permanent.
Alex On Mar 15, 2012, at 3:54 PM, Alexander Shraer <[EMAIL PROTECTED]> wrote: > I think the concern is that the old VM can recover and try to > reconnect. Theoretically you could even go back and forth between new > and old VM. For example, suppose that you have servers > A, B and C in the cluster, A is the leader. C is slow and "replaced" > with C', then update U is acked by A and C', then A fails. In this > situation you cannot have additional failures. But with the > automatic replacement thing it can (theoretically) happen that C' > becomes a little slow, C connects to B and is chosen as leader, and > the committed update U is lost forever. This is perhaps unlikely but > possible... > > Alex > > On Thu, Mar 15, 2012 at 1:35 PM, <[EMAIL PROTECTED]> wrote: >> I agree with your points about any kind of VMs having a hard to predict runtime behaviour and that participants of the zookeeper ensemble running on a VM could fail to send keep-alives for an uncertain amount of time. But I don't yet understand how that would break the approach I was mentioning: Just trying to re-resolve the InetAddress after an IO exception should in that case still lead to the same original IP address (and eventually to that node rejoining the ensemble). >> Only if that host name (the old node was using) would be re-assigned to another instance this step of re-resolving would point to a new IP (and hence cause the old server to be replaced). >> >> Did I understand your objection correctly? >> >> ________________________________________ >> Von: ext Ted Dunning [[EMAIL PROTECTED]] >> Gesendet: Donnerstag, 15. März 2012 19:50 >> Bis: [EMAIL PROTECTED] >> Cc: [EMAIL PROTECTED] >> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >> >> Alexander's comment still applies. >> >> VM's can function or go away completely, but they can also malfunction >> in more subtle ways such that they just go VEEEERRRRY slowly. You >> have to account for that failure mode. These failures can even be >> transient. >> >> This would probably break your approach. >> >> On 3/15/12, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: >>> Oh sorry there is a slight misunderstanding. With VM I did not mean the java >>> vm but the Linux vm that contains the zookeeper node. We get notified if >>> that goes away and is repurposed. >>> >>> BR >>> Christian >>> >>> Gesendet von meinem Nokia Lumia 800 >>> ________________________________ >>> Von: ext Alexander Shraer >>> Gesendet: 15.03.2012 16:33 >>> An: [EMAIL PROTECTED]; Ziech Christian (Nokia-LC/Berlin) >>> Betreff: Re: Zookeeper on short lived VMs and ZOOKEEPER-107 >>> >>> yes, by replacing x at a time from 2x+1 you have quorum intersection. >>> >>> i have one more question - zookeeper itself doesn't assume perfect >>> failure detection, which your scheme requires. what if the VM didn't >>> actually fail but just slow and then tries to reconnect ? >>> >>> On Thu, Mar 15, 2012 at 2:50 AM, Christian Ziech >>> <[EMAIL PROTECTED]> wrote: >>>> I don't think that we could be running into a split brain problem in our >>>> use >>>> case. >>>> Let me try to describe the scenario we are worried about (assuming an >>>> ensemble of 5 nodes A,B,C,D,E): >>>> - The ensemble is up and running and in sync >>>> - Node A with the host name "zookeeperA.whatever-domain.priv" goes down >>>> because the VM has gone away >>>> - That removal of the VM is detected and a new VM is spawned with the same >>>> host name "zookeeperA.whatever-domain.priv" - let's call that node A' >>>> - Node A' zookeeper wants to join the cluster - right now this gets +
Alexander Shraer 2012-03-16, 03:43
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Christian Ziech 2012-03-16, 09:56
Under normal circumstances the ability to detect failures correctly
should be given. The scenario I'm aware of includes one zookeeper system would be taken down for a reason and then possibly just rebooted or even started from scratch elsewhere. In both cases however the new host would have the old dns name but most likely a different IP. But of course that only applies to us and possibly not to all of the users. When thinking about the scenario you described I understood where the problem lies. However wouldn't the same problem also be relevant the way zookeeper is implemented right now? Let me try to explain why (possibly I'm wrong here since I may miss some points on how zookeeper servers works internally - corrections are very welcome): - Same scenarios as you described - nodes A with host name a, B host name b and C with host name c - Also same as in your scenario C is due to some human error falsely detected as down. Hence C' is brought up and is assigned the same DNS name as C - Now rolling restarts are performed to bring in C' - A resolves c correctly to the new IP and connects to C' but B still resolves the host name c to the original address of C and hence does not connect (I think some DNS slowness is also required for your approach in order for the host name c being resolved to the original IP of C) - now the rest of your scenario happens: Update U is applied, C' gets slow, C recovers and A fails. Of course also this approach requires some DNS craziness but if I did not make a mistake in my thoughts it should still be possible. PS: Wouldn't your scenario not also invalidate the solution of the hbase guys using amazons elastic ips to solve the same problem (see https://issues.apache.org/jira/browse/HBASE-2327)? PS2: If the approach I had in mind is not valid, do you guys already have a plan for when 3.5.0 would be released or could you guys be supported in some way so that zookeeper-107 makes it sooner into a release? Am 16.03.2012 04:43, schrieb ext Alexander Shraer: > Actually its still not clear to me how you would enforce the 2x+1. In Zookeeper we can guarantee liveness (progress) only when x+1 are connected and up, however safety (correctness) is always guaranteed, even if 2 out of 3 servers are temporarily down. Your design needs the 2x+1 for safety, which I think is problematic unless you can accurately detect failures (synchrony) and failures are permanent. > > Alex > > > On Mar 15, 2012, at 3:54 PM, Alexander Shraer<[EMAIL PROTECTED]> wrote: > >> I think the concern is that the old VM can recover and try to >> reconnect. Theoretically you could even go back and forth between new >> and old VM. For example, suppose that you have servers >> A, B and C in the cluster, A is the leader. C is slow and "replaced" >> with C', then update U is acked by A and C', then A fails. In this >> situation you cannot have additional failures. But with the >> automatic replacement thing it can (theoretically) happen that C' >> becomes a little slow, C connects to B and is chosen as leader, and >> the committed update U is lost forever. This is perhaps unlikely but >> possible... >> >> Alex >> >> On Thu, Mar 15, 2012 at 1:35 PM,<[EMAIL PROTECTED]> wrote: >>> I agree with your points about any kind of VMs having a hard to predict runtime behaviour and that participants of the zookeeper ensemble running on a VM could fail to send keep-alives for an uncertain amount of time. But I don't yet understand how that would break the approach I was mentioning: Just trying to re-resolve the InetAddress after an IO exception should in that case still lead to the same original IP address (and eventually to that node rejoining the ensemble). >>> Only if that host name (the old node was using) would be re-assigned to another instance this step of re-resolving would point to a new IP (and hence cause the old server to be replaced). >>> >>> Did I understand your objection correctly? >>> >>> ________________________________________ +
Christian Ziech 2012-03-16, 09:56
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Ted Dunning 2012-03-16, 15:51
On Fri, Mar 16, 2012 at 9:56 AM, Christian Ziech
<[EMAIL PROTECTED]>wrote: > Under normal circumstances the ability to detect failures correctly should > be given. The scenario I'm aware of includes one zookeeper system would be > taken down for a reason and then possibly just rebooted or even started > from scratch elsewhere. In both cases however the new host would have the > old dns name but most likely a different IP. But of course that only > applies to us and possibly not to all of the users. > This is a bizarre way to start a post on HA considerations. Detecting failures is always subject to errors. You can make the detection process less broken, but there is a core uncertainty that is inherent in the problem. You bias the detection process toward false positives or false negatives, but you can't completely get rid of either kind of error without substantially increasing the total number of errors. Most people bias strongly toward false negatives (system is marked as up, but is down) if only because the system impact of false positives can be quite high and because the cost of pushing toward faster detection of failures can also be very high (consider what it would mean to have pings every 100ms... the server under test would have to be re-designed from the ground up with hard real-time principles in mind). Given this context, all HA designs have to account for erroneous marking of systems. This is closely related to the CAP theorem. There, the whole point is that you can't really distinguish the alternatives that you are cut off from the system in question or that it is down. In practice, your uncertainty is even worse than that. So you really have to design around a statement that the failure detection system will have a (1-epsilon_1) probability of being correct when it marks systems as down and that it will have a (1-epsilon_2) probability of detecting failures within t_1 seconds. Furthermore, the probability of detecting failures should smoothly transition to (1-epsilon_3) within t_2 seconds. For heartbeat based systems where n heart-beats must be lost, epsilon_1 is pretty small, but distinctly non-zero, epsilon_2 and t_1 are 1 and n-1 ticks respectively and epsilon_3 is on the close order of epsilon_1 and t_2 is somewhere near n ticks. This implies that you cannot detect failures in less than a certain amount of time and that you will still miss some failures. When I am designing, I try to avoid assuming that epsilon_1 and epsilon_3 are less than about 0.1%. If you factor in a model like this into your design, you inherently no longer make statements like "failure is impossible". Instead, you say failure has probability < p of occurring in t seconds. If you persist in the former, you will be very wrong much of the time and will be unable to optimize the correct function of your system, nor recognize what is happening when it does fail (as it will). - Same scenarios as you described - nodes A with host name a, B host name b > and C with host name c > - Also same as in your scenario C is due to some human error falsely > detected as down. Hence C' is brought up and is assigned the same DNS name > as C > - Now rolling restarts are performed to bring in C' > - A resolves c correctly to the new IP and connects to C' but B still > resolves the host name c to the original address of C and hence does not > connect (I think some DNS slowness is also required for your approach in > order for the host name c being resolved to the original IP of C) > This is hardly surprising given DNS timeouts and caching. Consider what would happen if B has C cached and is partitioned away from the DNS server. > - now the rest of your scenario happens: Update U is applied, C' gets > slow, C recovers and A fails. > Of course also this approach requires some DNS craziness but if I did not > make a mistake in my thoughts it should still be possible. > This isn't craziness. This is reality. And frankly, you are assuming that A and B are even being served by the same DNS server. My experience is that DNS is messed up at an astonishing percentage of otherwise very sophisticated installations. You are assuming that DNS can handle a task (fast updates) that most systems do not assume that it can do. Note that simply making the claim "my DNS is not messed up" is only very weak evidence to me that your DNS is broken somehow. Most of the admins of correctly operating DNS say "we have had broken DNS in the past, tell me what you need and I will check". Don't think so. Those guys are replacing the IP address itself so all traffic inherently moves to the new machine. There can be a short window of misdirection, but elastic IP's work very well. Moreover, you can firmly take down the original on EC2 and you can release the IP manually which makes the errors be almost entirely that the server in question simply cannot be reached rather than there being uncertainty about which is being reached. +
Ted Dunning 2012-03-16, 15:51
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Alexander Shraer 2012-03-16, 18:37
I think this is why when you're doing rolling restarts /
reconfiguration you should never have two different servers that have any chance of being up at the same time with the same id. With 107 you'd have to remove the server and add a new server with some different id (choosing the new id is left to the user). In terms of support with 107 we need all the help we can get :) Currently there are two parts of it in pretty good shape that I'm hoping to integrate soon: 1355 and 1411. Comments or testing of 1411 would be very helpful at this point. Also, if you wish, you can check out the latest patch for 107 (that patch is not going to be integrated - instead I'm trying to get it in piece by piece, but still, you can try it and see if it works for you or if you have comments. You can also help by writing tests for it). Best Regards, Alex On Fri, Mar 16, 2012 at 2:56 AM, Christian Ziech <[EMAIL PROTECTED]> wrote: > Under normal circumstances the ability to detect failures correctly should > be given. The scenario I'm aware of includes one zookeeper system would be > taken down for a reason and then possibly just rebooted or even started from > scratch elsewhere. In both cases however the new host would have the old dns > name but most likely a different IP. But of course that only applies to us > and possibly not to all of the users. > > When thinking about the scenario you described I understood where the > problem lies. However wouldn't the same problem also be relevant the way > zookeeper is implemented right now? Let me try to explain why (possibly I'm > wrong here since I may miss some points on how zookeeper servers works > internally - corrections are very welcome): > - Same scenarios as you described - nodes A with host name a, B host name b > and C with host name c > - Also same as in your scenario C is due to some human error falsely > detected as down. Hence C' is brought up and is assigned the same DNS name > as C > - Now rolling restarts are performed to bring in C' > - A resolves c correctly to the new IP and connects to C' but B still > resolves the host name c to the original address of C and hence does not > connect (I think some DNS slowness is also required for your approach in > order for the host name c being resolved to the original IP of C) > - now the rest of your scenario happens: Update U is applied, C' gets slow, > C recovers and A fails. > Of course also this approach requires some DNS craziness but if I did not > make a mistake in my thoughts it should still be possible. > > PS: Wouldn't your scenario not also invalidate the solution of the hbase > guys using amazons elastic ips to solve the same problem (see > https://issues.apache.org/jira/browse/HBASE-2327)? > PS2: If the approach I had in mind is not valid, do you guys already have a > plan for when 3.5.0 would be released or could you guys be supported in some > way so that zookeeper-107 makes it sooner into a release? > > Am 16.03.2012 04:43, schrieb ext Alexander Shraer: > >> Actually its still not clear to me how you would enforce the 2x+1. In >> Zookeeper we can guarantee liveness (progress) only when x+1 are connected >> and up, however safety (correctness) is always guaranteed, even if 2 out of >> 3 servers are temporarily down. Your design needs the 2x+1 for safety, which >> I think is problematic unless you can accurately detect failures (synchrony) >> and failures are permanent. >> >> Alex >> >> >> On Mar 15, 2012, at 3:54 PM, Alexander Shraer<[EMAIL PROTECTED]> wrote: >> >>> I think the concern is that the old VM can recover and try to >>> reconnect. Theoretically you could even go back and forth between new >>> and old VM. For example, suppose that you have servers >>> A, B and C in the cluster, A is the leader. C is slow and "replaced" >>> with C', then update U is acked by A and C', then A fails. In this >>> situation you cannot have additional failures. But with the >>> automatic replacement thing it can (theoretically) happen that C' +
Alexander Shraer 2012-03-16, 18:37
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Christian Ziech 2012-03-19, 12:11
I think I should be able to help here a little then. Applying the huge
patch did not work any more on the current trunk for me (did not apply cleanly). But if you'd describe of what kind of support you'd need exactly (which tests, checking which code etc), I should be able to spend some time on that. Am 16.03.2012 19:37, schrieb ext Alexander Shraer: > I think this is why when you're doing rolling restarts / > reconfiguration you should never have two different servers that have > any chance of being up at the same time with the same id. > With 107 you'd have to remove the server and add a new server with > some different id (choosing the new id is left to the user). > > In terms of support with 107 we need all the help we can get :) > Currently there are two parts of it in pretty good shape that I'm > hoping to integrate soon: 1355 and 1411. > Comments or testing of 1411 would be very helpful at this point. Also, > if you wish, you can check out the latest patch for 107 (that patch is > not going to be integrated - instead I'm trying to get it in piece by > piece, but still, you can try it and see if it works for you or if you > have comments. You can also help by writing tests for it). > > Best Regards, > Alex > > On Fri, Mar 16, 2012 at 2:56 AM, Christian Ziech > <[EMAIL PROTECTED]> wrote: >> Under normal circumstances the ability to detect failures correctly should >> be given. The scenario I'm aware of includes one zookeeper system would be >> taken down for a reason and then possibly just rebooted or even started from >> scratch elsewhere. In both cases however the new host would have the old dns >> name but most likely a different IP. But of course that only applies to us >> and possibly not to all of the users. >> >> When thinking about the scenario you described I understood where the >> problem lies. However wouldn't the same problem also be relevant the way >> zookeeper is implemented right now? Let me try to explain why (possibly I'm >> wrong here since I may miss some points on how zookeeper servers works >> internally - corrections are very welcome): >> - Same scenarios as you described - nodes A with host name a, B host name b >> and C with host name c >> - Also same as in your scenario C is due to some human error falsely >> detected as down. Hence C' is brought up and is assigned the same DNS name >> as C >> - Now rolling restarts are performed to bring in C' >> - A resolves c correctly to the new IP and connects to C' but B still >> resolves the host name c to the original address of C and hence does not >> connect (I think some DNS slowness is also required for your approach in >> order for the host name c being resolved to the original IP of C) >> - now the rest of your scenario happens: Update U is applied, C' gets slow, >> C recovers and A fails. >> Of course also this approach requires some DNS craziness but if I did not >> make a mistake in my thoughts it should still be possible. >> >> PS: Wouldn't your scenario not also invalidate the solution of the hbase >> guys using amazons elastic ips to solve the same problem (see >> https://issues.apache.org/jira/browse/HBASE-2327)? >> PS2: If the approach I had in mind is not valid, do you guys already have a >> plan for when 3.5.0 would be released or could you guys be supported in some >> way so that zookeeper-107 makes it sooner into a release? >> >> Am 16.03.2012 04:43, schrieb ext Alexander Shraer: >> >>> Actually its still not clear to me how you would enforce the 2x+1. In >>> Zookeeper we can guarantee liveness (progress) only when x+1 are connected >>> and up, however safety (correctness) is always guaranteed, even if 2 out of >>> 3 servers are temporarily down. Your design needs the 2x+1 for safety, which >>> I think is problematic unless you can accurately detect failures (synchrony) >>> and failures are permanent. >>> >>> Alex >>> >>> >>> On Mar 15, 2012, at 3:54 PM, Alexander Shraer<[EMAIL PROTECTED]> wrote: >>> >>>> I think the concern is that the old VM can recover and try to +
Christian Ziech 2012-03-19, 12:11
-
Re: Zookeeper on short lived VMs and ZOOKEEPER-107Benjamin Reed 2012-03-16, 18:15
wrt ps2 the rule is that bug fix releases are only for bugs. so 3.4.X
shouldn't have more features than 3.4.0. so this really is something for 3.5.0. it would be nice to have shorter release cycles. 3.4.0 was released in november, so we should be doing a 3.5.0 release some time relatively soon. ben On Fri, Mar 16, 2012 at 2:56 AM, Christian Ziech <[EMAIL PROTECTED]> wrote: > Under normal circumstances the ability to detect failures correctly should > be given. The scenario I'm aware of includes one zookeeper system would be > taken down for a reason and then possibly just rebooted or even started from > scratch elsewhere. In both cases however the new host would have the old dns > name but most likely a different IP. But of course that only applies to us > and possibly not to all of the users. > > When thinking about the scenario you described I understood where the > problem lies. However wouldn't the same problem also be relevant the way > zookeeper is implemented right now? Let me try to explain why (possibly I'm > wrong here since I may miss some points on how zookeeper servers works > internally - corrections are very welcome): > - Same scenarios as you described - nodes A with host name a, B host name b > and C with host name c > - Also same as in your scenario C is due to some human error falsely > detected as down. Hence C' is brought up and is assigned the same DNS name > as C > - Now rolling restarts are performed to bring in C' > - A resolves c correctly to the new IP and connects to C' but B still > resolves the host name c to the original address of C and hence does not > connect (I think some DNS slowness is also required for your approach in > order for the host name c being resolved to the original IP of C) > - now the rest of your scenario happens: Update U is applied, C' gets slow, > C recovers and A fails. > Of course also this approach requires some DNS craziness but if I did not > make a mistake in my thoughts it should still be possible. > > PS: Wouldn't your scenario not also invalidate the solution of the hbase > guys using amazons elastic ips to solve the same problem (see > https://issues.apache.org/jira/browse/HBASE-2327)? > PS2: If the approach I had in mind is not valid, do you guys already have a > plan for when 3.5.0 would be released or could you guys be supported in some > way so that zookeeper-107 makes it sooner into a release? > > Am 16.03.2012 04:43, schrieb ext Alexander Shraer: >> >> Actually its still not clear to me how you would enforce the 2x+1. In >> Zookeeper we can guarantee liveness (progress) only when x+1 are connected >> and up, however safety (correctness) is always guaranteed, even if 2 out of >> 3 servers are temporarily down. Your design needs the 2x+1 for safety, which >> I think is problematic unless you can accurately detect failures (synchrony) >> and failures are permanent. >> >> Alex >> >> >> On Mar 15, 2012, at 3:54 PM, Alexander Shraer<[EMAIL PROTECTED]> wrote: >> >>> I think the concern is that the old VM can recover and try to >>> reconnect. Theoretically you could even go back and forth between new >>> and old VM. For example, suppose that you have servers >>> A, B and C in the cluster, A is the leader. C is slow and "replaced" >>> with C', then update U is acked by A and C', then A fails. In this >>> situation you cannot have additional failures. But with the >>> automatic replacement thing it can (theoretically) happen that C' >>> becomes a little slow, C connects to B and is chosen as leader, and >>> the committed update U is lost forever. This is perhaps unlikely but >>> possible... >>> >>> Alex >>> >>> On Thu, Mar 15, 2012 at 1:35 PM,<[EMAIL PROTECTED]> wrote: >>>> >>>> I agree with your points about any kind of VMs having a hard to predict >>>> runtime behaviour and that participants of the zookeeper ensemble running on >>>> a VM could fail to send keep-alives for an uncertain amount of time. But I >>>> don't yet understand how that would break the approach I was mentioning: +
Benjamin Reed 2012-03-16, 18:15
|