Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper >> mail # dev >> dynamic membership (reconfiguration) in Zookeeper

Copy link to this message
Re: dynamic membership (reconfiguration) in Zookeeper
Hi Alex,

Great! Thanks for the design. I have a few suggestions/questions below.

A. Bootstrapping the cluster
    The algorithm assumes that a leader exists prior to doing a reconfig.
So it looks like the reconfig() API is not intended to use for bootstrapping
the cluster. How about we define a API for initializing the cluster?

Instead of relying on the current method of setting the initial
in the config file, we should probably also have to define a join(M) (or
init(M)) API. When a peer receives this request, it will try to connect to
specified peers.  During bootstrap peers will connect to each other (as they
now) and elect a leader.

B. Storing membership in znode and updating client (a tangential topic to
    Earlier ZOOKEEPER-107 proposed of using a URI based approach for the
clients to fetch server list. I am not opposed to the URI based approach,
however, that shouldn't be our only approach. URI based approach requires
resources (e.g., fault tolerant web service or shared storage for file,
In certain environments it may not be feasible to have such a resource.
Instead, can we use a mechanism similar to ZooKeeper watchpoints for this?
Store the membership information in a znode and let the ZooKeeper server
the clients of the changed configuration.

C. Locating most recent config on servers
    The URI based approach will be more helpful to servers. For example, if
A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes
online it won't be able to locate the most recent config. In this case, A
query the URI. Second approach is to ask the leader to try to periodically
send the membership (at least to nodes that are down).

D. "Send a message to all members(M’) instructing them to connect to
    leader(M) can potentially change after sending this message. Should
this be "Send a messages to all members(M') to connect to members(M)? See
point G. below.

Also, in step 7, why do you send leader(M') along with

E. "3. Submit a reconfig(M’) operation to leader(M)"
    What if leader(M) fails after receiving the request, but before
informing any other peer? How will the administrative client know whether to
retry or not?  Should s retry if leader fails and should the client API
if s fails?

F. Regarding 3.a and 3.b.

The algorithm mentions:

"3. Otherwise:
   a. Designate the most up-to-date server in connected quorum of M' to be
   b. Stop processing new operations, return failure for any further ops

Why do we need to change the leader if leader(M) is not in M'? How about we
the leader perform the reconfig and at the end of phase 3 (while processing
retire) the leader will give up leadership. This will cause a new leader
election and one of the peer in M' will become the leader. Similarly, after
second phase, members(M') will stop following leader(M) if leader(M) is not
M'. I think this will be simpler.

G. Online VS offline

    In your "Online vs offline" section, you mentioned that the offline
strategy is preferred. I think we can do reconfiguration online.
I pictured M' as modified OBSERVERS till the time the reconfiguration is

a. Let new members(M') join the cluster as OBSERVERS. Based on the current
implementation, M' will essentially sync with the leader after becoming
OBSERVERS and it will be easier to leverage the RequestProcessor pipeline

b. Once a majority of M' join the leader, the leader executes phase 1 as you

c. After phase 1., the leader changes the transaction logic a bit. Every
transaction after this point (including reconfig-start) will be sent to M
M'.  Leader then waits for quorum(M) and quorum(M') to ack the transaction.
M' are not pure OBSERVERS as we think today. However, only a member of M can
become a leader until reconfig succeeds.  Also, M' - (M n M') cannot serve
client requests until reconfiguration is complete. By doing a transaction on
both M and M' and waiting for the quorum of each set to ack, we keep
the state to both the configurations.

d. After receiving ack for phase 2, the leader sends <switch-to-new-config>
transaction to M and M' (instead of sending retire just to M).

After receiving this message, M' will close connections with (and reject
connections from) members not in M'.  Members that are supposed to leave the
cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new
leader(M') will be elected.

Let me know what you think about this.

H. What happens if a majority of M' keep failing during reconfig?

M={A, B}, M'={A, C, D, E}. What if D, E fail?

Failure of a majority of M' will permanently stall reconfig. While this is
less likely to happen, I think ZooKeeper should handle this
automatically. After a few retries, we should abort reconfig.  Otherwise, we
could disrupt a running cluster and we will never be able to recover without
manual intervention. If the majority fails after phase 1, then this would
sending a <abort-reconfig, version(M), M'> to M.

Of course, one can argue - what if majority of M' fail after phase 3? So not
sure if this is an overkill, but I feel we should handle this case.

I. "Old reconfiguration request"

a. We should use ZAB
b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to
work. This bug causes logs to diverge if ZK leader fails before sending
PROPOSALs to followers (see
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00403.html).

Because of this bug we could run into the following scenario:
- A peer B that was leader when reconfig(M') was issued will have reconfig
  in its transaction log.
- A peer C that became leader after B's failure, can have reconfig(M'') in
- Now, if B and C fail (say both reboot), then the outcome of reconfig will
  depend on which node takes leadership. If B becomes a leader, then out
is M'. If C becomes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB