Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # user - Problems about Zab protocol

Copy link to this message
Re: Problems about Zab protocol
Flavio Junqueira 2011-04-25, 07:52
Daidong, There are several key differences between distributed  
transactions and the replication problem we solve in ZooKeeper, and if  
you are interested in understanding them, you might start by having a  
look at the Paxos Commit work of Gray and Lamport. They have a TR  
available online, just use your favorite search engine.


On Apr 23, 2011, at 6:55 AM, daidong wrote:

> Hi, Alex
> Thanks for your reply and Flavio's
> I think i finally get the idea. :)
> Would it be appropriate to see the ZAB as a 3PC without the READY/
> WAIT status? As all the participators will reply VOTE_COMMIT (they  
> do not abort...).
> I will read the source code and hope can do some stuff with ZAB.  
> Thanks a lot for all the replies.
> --
> daidong
> On 2011年4月22日星期五 at 上午3:54, Alexander Shraer [via  
> zookeeper-user] wrote:
>> Hi Daidong,
>> In addition to Flavio's response, I'll try to address some of your  
>> specific questions.
>>> In my opinion, an atomic broadcast protocol must guarantee all the  
>>> non-
>>> faulty servers have the same status eventually. So in the 2PC  
>>> protocol,
>>> the coordinator must block until "all" the servers reply "ok".
>> Designed this way, the protocol wouldn't be able to tolerate any  
>> failures - the leader could block
>> waiting for a response from a server that had crashed. The idea is  
>> to receive enough "ok" messages
>> to guarantee that even if a minority of servers crash, the  
>> information is still not lost. That's why
>> the leader waits for a majority of acks. Messages are still sent to  
>> all followers, so they will eventually
>> get them (or if they disconnect they will later reconnect and synch  
>> with the leader automatically).
>> Regarding your second question - formally, sequential consistency  
>> guarantees that operations of each client take effect in the order
>> they were submitted by the client - so a client's read is  
>> guaranteed to see its own last complete write.
>> In the example you mention, the client first executes a create()  
>> and then getChildren(). If clients C1 and C2 both submit a create()
>> concurrently, one of these requests will reach the leader and will  
>> be scheduled by the leader before the other one, suppose the  
>> create() request of C1.
>> Then, when C2 is notified about the completion of its own create,  
>> FIFO ensures that it also finds out about any operation that  
>> completed before that create()
>> (these messages were sent by the leader earlier). So when C2  
>> finally runs getChildren(), its local state will already have every  
>> operation that was scheduled
>> by the leader before its own create() completed.
>> In general, ZAB implements state-machine replication by executing  
>> consensus on each operation. To understand the general idea,
>> I recommend reading Lamport's "Paxos made simple" paper I sent  
>> earlier - it has a constructive explanation of this
>> (although the algorithm is somewhat different from ZAB).
>> Alex
>>> -----Original Message-----
>>> From: daidong [mailto:]
>>> Sent: Wednesday, April 20, 2011 11:31 PM
>>> To: [hidden email]
>>> Subject: Re: RE: Problems about Zab protocol
>>> Hi, Alex
>>> Thanks for your reply. :)
>>> I knew ZAB has two modes, but things i do not quit understand  
>>> focus on
>>> the broadcast mode. In the ZAB paper, authors said ZAB is a simple
>>> version of two phases commit protocol because we don't have abort
>>> actions in followers. I do not quit understand this.
>>> In my opinion, an atomic broadcast protocol must guarantee all the  
>>> non-
>>> faulty servers have the same status eventually. So in the 2PC  
>>> protocol,
>>> the coordinator must block until "all" the servers reply "ok". If  
>>> there
>>> is not any abort too, consider the situation that we have a very  
>>> slow
>>> follower F who processes messages slower than other followers.
>>> According TCP and FIFO channel, We can say all the messages will be


research scientist

direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301