Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Zookeeper >> mail # user >> question on ZAB protocol


+
Yang 2011-07-13, 01:45
+
Yang 2011-07-13, 01:46
+
Yang 2011-07-13, 02:59
Copy link to this message
-
Re: question on ZAB protocol
Hi Yang, Some answers between lines:

On Jul 13, 2011, at 3:45 AM, Yang wrote:

> the leader broadcasts a write to all replicas, and then waits for a
> quorum to reply, before sending out the COMMIT.
> why is the quorum necessary (i.e. why can't the leader just wait for
> one reply and start sending the COMMIT?)??
>
> now that I think about it, it seems that waiting for just one reply is
> enough, because the connection from leader to replicas are FIFO, as
> long as the replicas do not die,
> they will eventually get the writes, even though the writes arrive at
> them after the leader starts the COMMIT.
>

Just to clarify, FIFO guarantees that a prefix of what has been sent  
is received and that the messages follow the send order. Delivery is  
not guaranteed, though. If we deliver no message, FIFO is still  
preserved.

> the only reason I can think of  for using a quorum is to tolerate more
> failures: if the only replied replica  dies, and leader dies, then we
> lose that  latest write.

Committing a transaction based on a single acknowledgement does not  
guarantee the durability of the operation in general if you want to  
tolerate at least one crash. Once a client receives a confirmation of  
an operation, enough ZooKeeper replicas must have persisted the  
corresponding update. Enough in this case implies that any subset of  
replicas used to recover upon a leader crash must have at least one  
server containing that update.

Now, I think that your optimization works only in the special case  
that the client that issued the operation is connected to a follower  
and we have 3 replicas, assuming majority quorums. If the client is  
connected to the leader directly, then it doesn't work. It also  
doesn't work if we need to tolerate at least 2 crashes, independent of  
the status of the replica the client is connected to.

> by requiring f ACKs, you can tolerate f-1 failures. but then you don't
> really need 2f+1 nodes in the ZK cluster, just f+1 is enough.

This derivation might be obvious, but let me put it down for  
completeness at least. Say that we use at least q servers to recover  
upon a leader crash. To ensure progress, we need at least "n >= q + f"  
replicas, where f is a threshold on the number of simultaneous crashes  
you're willing to tolerate. If "q <= f", then we can't guarantee that  
transactions committed during a recovery step are durable, since  
another subsequent recovery step might not have any replica in common  
with the previous one. Consequently, we have that q is at least f+1,  
which gives us "n >= 2f + 1"

Does it make sense?

-Flavio
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB