|
|
-
question on ZAB protocol
Yang 2011-07-13, 01:45
I read the ZAB paper before, and never realized this question, but find out today that I can't answer why, so I'm bringing it up here.
according to the paper
B. Reed and F. P. Junqueira. A simple totally ordered broadcast protocol. In LADIS ’08: Proceedings of the 2nd Workshop on Large-Scale Distributed Systems and Middleware, pages 1–6, New York, NY, USA, 2008. ACM.
the leader broadcasts a write to all replicas, and then waits for a quorum to reply, before sending out the COMMIT. why is the quorum necessary (i.e. why can't the leader just wait for one reply and start sending the COMMIT?)??
now that I think about it, it seems that waiting for just one reply is enough, because the connection from leader to replicas are FIFO, as long as the replicas do not die, they will eventually get the writes, even though the writes arrive at them after the leader starts the COMMIT.
the only reason I can think of for using a quorum is to tolerate more failures: if the only replied replica dies, and leader dies, then we lose that latest write. by requiring f ACKs, you can tolerate f-1 failures. but then you don't really need 2f+1 nodes in the ZK cluster, just f+1 is enough. Thanks a lot Yang
-
Re: question on ZAB protocol
Yang 2011-07-13, 01:46
btw, to give proper credit, I thought about this question after reading http://www.vldb.org/pvldb/vol4/p243-rao.pdfwhich actually just waits for 1 reply On Tue, Jul 12, 2011 at 6:45 PM, Yang <[EMAIL PROTECTED]> wrote: > I read the ZAB paper before, and never realized this question, but > find out today that I can't answer why, so I'm bringing it up here. > > according to the paper > > B. Reed and F. P. Junqueira. A simple totally ordered broadcast > protocol. In LADIS ’08: Proceedings of the 2nd Workshop > on Large-Scale Distributed Systems and Middleware, pages 1–6, > New York, NY, USA, 2008. ACM. > > > > the leader broadcasts a write to all replicas, and then waits for a > quorum to reply, before sending out the COMMIT. > why is the quorum necessary (i.e. why can't the leader just wait for > one reply and start sending the COMMIT?)?? > > now that I think about it, it seems that waiting for just one reply is > enough, because the connection from leader to replicas are FIFO, as > long as the replicas do not die, > they will eventually get the writes, even though the writes arrive at > them after the leader starts the COMMIT. > > the only reason I can think of for using a quorum is to tolerate more > failures: if the only replied replica dies, and leader dies, then we > lose that latest write. > by requiring f ACKs, you can tolerate f-1 failures. but then you don't > really need 2f+1 nodes in the ZK cluster, just f+1 is enough. > > > Thanks a lot > Yang >
-
Re: question on ZAB protocol
Yang 2011-07-13, 02:59
I guess by asking what happens in the following scenarios, I'd understand it better:
let's say we have nodes A B C D E, now A is the leader
A broadcasts <1,1>, it reaches B, then A, B die, C D E elect someone, the new system is going to throw away <1,1> since it does not know its existence, right?
start from scratch, A broadcasts<1,1> , it reaches all, all send ACK to A, but A dies before receiving the ACK, then BCDE elects someone, and the new leader sees <1,1> in log, so it broadcasts <1,1> to BCDE, which all commit it. now if we look back, when A dies, the client should get a "write failure", but now after BCDE relection, the written value does get into the system ???
Thanks Yang I kind of see another possible reason for the quorum:
when the new leader is elected, it needs to find out the longest history of all followers, and bring everybody up to date to that longest history. if every write reaches a quorum, then a quorum during election is guaranteed to give the new leader the full history.
but against the above argument, if old leader gets quorum, broadcasts a COMMIT to all, but COMMIT reaches only 1 of them, and then both then the Quorum dies, and the follower that has the COMMIT just served the new data to client, then it dies too, now the election process is going find a history missing the last written record. so some written records are going to be lost anyway,
On Tue, Jul 12, 2011 at 6:45 PM, Yang <[EMAIL PROTECTED]> wrote: > I read the ZAB paper before, and never realized this question, but > find out today that I can't answer why, so I'm bringing it up here. > > according to the paper > > B. Reed and F. P. Junqueira. A simple totally ordered broadcast > protocol. In LADIS ’08: Proceedings of the 2nd Workshop > on Large-Scale Distributed Systems and Middleware, pages 1–6, > New York, NY, USA, 2008. ACM. > > > > the leader broadcasts a write to all replicas, and then waits for a > quorum to reply, before sending out the COMMIT. > why is the quorum necessary (i.e. why can't the leader just wait for > one reply and start sending the COMMIT?)?? > > now that I think about it, it seems that waiting for just one reply is > enough, because the connection from leader to replicas are FIFO, as > long as the replicas do not die, > they will eventually get the writes, even though the writes arrive at > them after the leader starts the COMMIT. > > the only reason I can think of for using a quorum is to tolerate more > failures: if the only replied replica dies, and leader dies, then we > lose that latest write. > by requiring f ACKs, you can tolerate f-1 failures. but then you don't > really need 2f+1 nodes in the ZK cluster, just f+1 is enough. > > > Thanks a lot > Yang >
-
Re: question on ZAB protocol
Flavio Junqueira 2011-07-13, 12:50
Hi Yang, Some answers between lines:
On Jul 13, 2011, at 3:45 AM, Yang wrote:
> the leader broadcasts a write to all replicas, and then waits for a > quorum to reply, before sending out the COMMIT. > why is the quorum necessary (i.e. why can't the leader just wait for > one reply and start sending the COMMIT?)?? > > now that I think about it, it seems that waiting for just one reply is > enough, because the connection from leader to replicas are FIFO, as > long as the replicas do not die, > they will eventually get the writes, even though the writes arrive at > them after the leader starts the COMMIT. >
Just to clarify, FIFO guarantees that a prefix of what has been sent is received and that the messages follow the send order. Delivery is not guaranteed, though. If we deliver no message, FIFO is still preserved.
> the only reason I can think of for using a quorum is to tolerate more > failures: if the only replied replica dies, and leader dies, then we > lose that latest write.
Committing a transaction based on a single acknowledgement does not guarantee the durability of the operation in general if you want to tolerate at least one crash. Once a client receives a confirmation of an operation, enough ZooKeeper replicas must have persisted the corresponding update. Enough in this case implies that any subset of replicas used to recover upon a leader crash must have at least one server containing that update.
Now, I think that your optimization works only in the special case that the client that issued the operation is connected to a follower and we have 3 replicas, assuming majority quorums. If the client is connected to the leader directly, then it doesn't work. It also doesn't work if we need to tolerate at least 2 crashes, independent of the status of the replica the client is connected to.
> by requiring f ACKs, you can tolerate f-1 failures. but then you don't > really need 2f+1 nodes in the ZK cluster, just f+1 is enough.
This derivation might be obvious, but let me put it down for completeness at least. Say that we use at least q servers to recover upon a leader crash. To ensure progress, we need at least "n >= q + f" replicas, where f is a threshold on the number of simultaneous crashes you're willing to tolerate. If "q <= f", then we can't guarantee that transactions committed during a recovery step are durable, since another subsequent recovery step might not have any replica in common with the previous one. Consequently, we have that q is at least f+1, which gives us "n >= 2f + 1"
Does it make sense?
-Flavio
+
Flavio Junqueira 2011-07-13, 12:50
|
|