Hi Yang, Some answers between lines:
On Jul 13, 2011, at 3:45 AM, Yang wrote:
> the leader broadcasts a write to all replicas, and then waits for a
> quorum to reply, before sending out the COMMIT.
> why is the quorum necessary (i.e. why can't the leader just wait for
> one reply and start sending the COMMIT?)??
> now that I think about it, it seems that waiting for just one reply is
> enough, because the connection from leader to replicas are FIFO, as
> long as the replicas do not die,
> they will eventually get the writes, even though the writes arrive at
> them after the leader starts the COMMIT.
Just to clarify, FIFO guarantees that a prefix of what has been sent
is received and that the messages follow the send order. Delivery is
not guaranteed, though. If we deliver no message, FIFO is still
> the only reason I can think of for using a quorum is to tolerate more
> failures: if the only replied replica dies, and leader dies, then we
> lose that latest write.
Committing a transaction based on a single acknowledgement does not
guarantee the durability of the operation in general if you want to
tolerate at least one crash. Once a client receives a confirmation of
an operation, enough ZooKeeper replicas must have persisted the
corresponding update. Enough in this case implies that any subset of
replicas used to recover upon a leader crash must have at least one
server containing that update.
Now, I think that your optimization works only in the special case
that the client that issued the operation is connected to a follower
and we have 3 replicas, assuming majority quorums. If the client is
connected to the leader directly, then it doesn't work. It also
doesn't work if we need to tolerate at least 2 crashes, independent of
the status of the replica the client is connected to.
> by requiring f ACKs, you can tolerate f-1 failures. but then you don't
> really need 2f+1 nodes in the ZK cluster, just f+1 is enough.
This derivation might be obvious, but let me put it down for
completeness at least. Say that we use at least q servers to recover
upon a leader crash. To ensure progress, we need at least "n >= q + f"
replicas, where f is a threshold on the number of simultaneous crashes
you're willing to tolerate. If "q <= f", then we can't guarantee that
transactions committed during a recovery step are durable, since
another subsequent recovery step might not have any replica in common
with the previous one. Consequently, we have that q is at least f+1,
which gives us "n >= 2f + 1"
Does it make sense?