|
Alexander Shraer
2011-03-10, 07:19
Vishal Kher
2011-03-13, 17:25
Mahadev Konar
2011-03-14, 06:40
|
-
dynamic membership (reconfiguration) in ZookeeperAlexander Shraer 2011-03-10, 07:19
Hi,
I'm working on adding support for dynamic membership changes in Zookeeper clusters (ZOOKEEPER-107). I've added a proposed algorithm and some discussion here: https://cwiki.apache.org/confluence/display/ZOOKEEPER/ClusterMembership Any comments and suggestions are very welcome. Best Regards, Alex
-
Re: dynamic membership (reconfiguration) in ZookeeperVishal Kher 2011-03-13, 17:25
Hi Alex,
Great! Thanks for the design. I have a few suggestions/questions below. A. Bootstrapping the cluster The algorithm assumes that a leader exists prior to doing a reconfig. So it looks like the reconfig() API is not intended to use for bootstrapping the cluster. How about we define a API for initializing the cluster? Instead of relying on the current method of setting the initial configuration in the config file, we should probably also have to define a join(M) (or init(M)) API. When a peer receives this request, it will try to connect to the specified peers. During bootstrap peers will connect to each other (as they do now) and elect a leader. B. Storing membership in znode and updating client (a tangential topic to this design). Earlier ZOOKEEPER-107 proposed of using a URI based approach for the clients to fetch server list. I am not opposed to the URI based approach, however, that shouldn't be our only approach. URI based approach requires extra resources (e.g., fault tolerant web service or shared storage for file, etc). In certain environments it may not be feasible to have such a resource. Instead, can we use a mechanism similar to ZooKeeper watchpoints for this? Store the membership information in a znode and let the ZooKeeper server inform the clients of the changed configuration. C. Locating most recent config on servers The URI based approach will be more helpful to servers. For example, if A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes online it won't be able to locate the most recent config. In this case, A can query the URI. Second approach is to ask the leader to try to periodically send the membership (at least to nodes that are down). D. "Send a message to all members(M’) instructing them to connect to leader(M)." leader(M) can potentially change after sending this message. Should this be "Send a messages to all members(M') to connect to members(M)? See point G. below. Also, in step 7, why do you send leader(M') along with <activate-config>message>? E. "3. Submit a reconfig(M’) operation to leader(M)" What if leader(M) fails after receiving the request, but before informing any other peer? How will the administrative client know whether to retry or not? Should s retry if leader fails and should the client API retry if s fails? F. Regarding 3.a and 3.b. The algorithm mentions: "3. Otherwise: a. Designate the most up-to-date server in connected quorum of M' to be leader(M) b. Stop processing new operations, return failure for any further ops received" Why do we need to change the leader if leader(M) is not in M'? How about we let the leader perform the reconfig and at the end of phase 3 (while processing retire) the leader will give up leadership. This will cause a new leader election and one of the peer in M' will become the leader. Similarly, after the second phase, members(M') will stop following leader(M) if leader(M) is not in M'. I think this will be simpler. G. Online VS offline In your "Online vs offline" section, you mentioned that the offline strategy is preferred. I think we can do reconfiguration online. I pictured M' as modified OBSERVERS till the time the reconfiguration is complete. a. Let new members(M') join the cluster as OBSERVERS. Based on the current implementation, M' will essentially sync with the leader after becoming OBSERVERS and it will be easier to leverage the RequestProcessor pipeline for reconfig. b. Once a majority of M' join the leader, the leader executes phase 1 as you described. c. After phase 1., the leader changes the transaction logic a bit. Every transaction after this point (including reconfig-start) will be sent to M and M'. Leader then waits for quorum(M) and quorum(M') to ack the transaction. So M' are not pure OBSERVERS as we think today. However, only a member of M can become a leader until reconfig succeeds. Also, M' - (M n M') cannot serve client requests until reconfiguration is complete. By doing a transaction on both M and M' and waiting for the quorum of each set to ack, we keep transfering the state to both the configurations. d. After receiving ack for phase 2, the leader sends <switch-to-new-config> transaction to M and M' (instead of sending retire just to M). After receiving this message, M' will close connections with (and reject connections from) members not in M'. Members that are supposed to leave the cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new leader(M') will be elected. Let me know what you think about this. H. What happens if a majority of M' keep failing during reconfig? M={A, B}, M'={A, C, D, E}. What if D, E fail? Failure of a majority of M' will permanently stall reconfig. While this is less likely to happen, I think ZooKeeper should handle this automatically. After a few retries, we should abort reconfig. Otherwise, we could disrupt a running cluster and we will never be able to recover without manual intervention. If the majority fails after phase 1, then this would mean sending a <abort-reconfig, version(M), M'> to M. Of course, one can argue - what if majority of M' fail after phase 3? So not sure if this is an overkill, but I feel we should handle this case. I. "Old reconfiguration request" a. We should use ZAB b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to work. This bug causes logs to diverge if ZK leader fails before sending PROPOSALs to followers (see http://www.mail-archive.com/[EMAIL PROTECTED]/msg00403.html). Because of this bug we could run into the following scenario: - A peer B that was leader when reconfig(M') was issued will have reconfig M' in its transaction log. - A peer C that became leader after B's failure, can have reconfig(M'') in its log. - Now, if B and C fail (say both reboot), then the outcome of reconfig will depend on which node takes leadership. If B becomes a leader, then out come is M'. If C becomes
-
Re: dynamic membership (reconfiguration) in ZookeeperMahadev Konar 2011-03-14, 06:40
Vishal/Alex,
Would it be good to have these comments/design on the jira? Its probably better to keep design discussion/comments on the jira. thanks mahadev On Sun, Mar 13, 2011 at 10:25 AM, Vishal Kher <[EMAIL PROTECTED]> wrote: > Hi Alex, > > Great! Thanks for the design. I have a few suggestions/questions below. > > A. Bootstrapping the cluster > The algorithm assumes that a leader exists prior to doing a reconfig. > So it looks like the reconfig() API is not intended to use for bootstrapping > the cluster. How about we define a API for initializing the cluster? > > Instead of relying on the current method of setting the initial > configuration > in the config file, we should probably also have to define a join(M) (or > init(M)) API. When a peer receives this request, it will try to connect to > the > specified peers. During bootstrap peers will connect to each other (as they > do > now) and elect a leader. > > B. Storing membership in znode and updating client (a tangential topic to > this > design). > Earlier ZOOKEEPER-107 proposed of using a URI based approach for the > clients to fetch server list. I am not opposed to the URI based approach, > however, that shouldn't be our only approach. URI based approach requires > extra > resources (e.g., fault tolerant web service or shared storage for file, > etc). > In certain environments it may not be feasible to have such a resource. > Instead, can we use a mechanism similar to ZooKeeper watchpoints for this? > Store the membership information in a znode and let the ZooKeeper server > inform > the clients of the changed configuration. > > C. Locating most recent config on servers > The URI based approach will be more helpful to servers. For example, if > A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes > online it won't be able to locate the most recent config. In this case, A > can > query the URI. Second approach is to ask the leader to try to periodically > send the membership (at least to nodes that are down). > > D. "Send a message to all members(M’) instructing them to connect to > leader(M)." > leader(M) can potentially change after sending this message. Should > this be "Send a messages to all members(M') to connect to members(M)? See > point G. below. > > Also, in step 7, why do you send leader(M') along with > <activate-config>message>? > > E. "3. Submit a reconfig(M’) operation to leader(M)" > What if leader(M) fails after receiving the request, but before > informing any other peer? How will the administrative client know whether to > retry or not? Should s retry if leader fails and should the client API > retry > if s fails? > > F. Regarding 3.a and 3.b. > > The algorithm mentions: > > "3. Otherwise: > a. Designate the most up-to-date server in connected quorum of M' to be > leader(M) > b. Stop processing new operations, return failure for any further ops > received" > > Why do we need to change the leader if leader(M) is not in M'? How about we > let > the leader perform the reconfig and at the end of phase 3 (while processing > retire) the leader will give up leadership. This will cause a new leader > election and one of the peer in M' will become the leader. Similarly, after > the > second phase, members(M') will stop following leader(M) if leader(M) is not > in > M'. I think this will be simpler. > > G. Online VS offline > > In your "Online vs offline" section, you mentioned that the offline > strategy is preferred. I think we can do reconfiguration online. > I pictured M' as modified OBSERVERS till the time the reconfiguration is > complete. > > a. Let new members(M') join the cluster as OBSERVERS. Based on the current > implementation, M' will essentially sync with the leader after becoming > OBSERVERS and it will be easier to leverage the RequestProcessor pipeline > for > reconfig. > > b. Once a majority of M' join the leader, the leader executes phase 1 as you > described. > > c. After phase 1., the leader changes the transaction logic a bit. Every |