Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Zookeeper, mail # dev - dynamic membership (reconfiguration) in Zookeeper


Copy link to this message
-
Re: dynamic membership (reconfiguration) in Zookeeper
Vishal Kher 2011-03-13, 17:25
Hi Alex,

Great! Thanks for the design. I have a few suggestions/questions below.

A. Bootstrapping the cluster
    The algorithm assumes that a leader exists prior to doing a reconfig.
So it looks like the reconfig() API is not intended to use for bootstrapping
the cluster. How about we define a API for initializing the cluster?

Instead of relying on the current method of setting the initial
configuration
in the config file, we should probably also have to define a join(M) (or
init(M)) API. When a peer receives this request, it will try to connect to
the
specified peers.  During bootstrap peers will connect to each other (as they
do
now) and elect a leader.

B. Storing membership in znode and updating client (a tangential topic to
this
design).
    Earlier ZOOKEEPER-107 proposed of using a URI based approach for the
clients to fetch server list. I am not opposed to the URI based approach,
however, that shouldn't be our only approach. URI based approach requires
extra
resources (e.g., fault tolerant web service or shared storage for file,
etc).
In certain environments it may not be feasible to have such a resource.
Instead, can we use a mechanism similar to ZooKeeper watchpoints for this?
Store the membership information in a znode and let the ZooKeeper server
inform
the clients of the changed configuration.

C. Locating most recent config on servers
    The URI based approach will be more helpful to servers. For example, if
A was down when M={A, B, C} was changed to M'={A, D, E}, then when A comes
online it won't be able to locate the most recent config. In this case, A
can
query the URI. Second approach is to ask the leader to try to periodically
send the membership (at least to nodes that are down).

D. "Send a message to all members(M’) instructing them to connect to
leader(M)."
    leader(M) can potentially change after sending this message. Should
this be "Send a messages to all members(M') to connect to members(M)? See
point G. below.

Also, in step 7, why do you send leader(M') along with
<activate-config>message>?

E. "3. Submit a reconfig(M’) operation to leader(M)"
    What if leader(M) fails after receiving the request, but before
informing any other peer? How will the administrative client know whether to
retry or not?  Should s retry if leader fails and should the client API
retry
if s fails?

F. Regarding 3.a and 3.b.

The algorithm mentions:

"3. Otherwise:
   a. Designate the most up-to-date server in connected quorum of M' to be
leader(M)
   b. Stop processing new operations, return failure for any further ops
received"

Why do we need to change the leader if leader(M) is not in M'? How about we
let
the leader perform the reconfig and at the end of phase 3 (while processing
retire) the leader will give up leadership. This will cause a new leader
election and one of the peer in M' will become the leader. Similarly, after
the
second phase, members(M') will stop following leader(M) if leader(M) is not
in
M'. I think this will be simpler.

G. Online VS offline

    In your "Online vs offline" section, you mentioned that the offline
strategy is preferred. I think we can do reconfiguration online.
I pictured M' as modified OBSERVERS till the time the reconfiguration is
complete.

a. Let new members(M') join the cluster as OBSERVERS. Based on the current
implementation, M' will essentially sync with the leader after becoming
OBSERVERS and it will be easier to leverage the RequestProcessor pipeline
for
reconfig.

b. Once a majority of M' join the leader, the leader executes phase 1 as you
described.

c. After phase 1., the leader changes the transaction logic a bit. Every
transaction after this point (including reconfig-start) will be sent to M
and
M'.  Leader then waits for quorum(M) and quorum(M') to ack the transaction.
So
M' are not pure OBSERVERS as we think today. However, only a member of M can
become a leader until reconfig succeeds.  Also, M' - (M n M') cannot serve
client requests until reconfiguration is complete. By doing a transaction on
both M and M' and waiting for the quorum of each set to ack, we keep
transfering
the state to both the configurations.

d. After receiving ack for phase 2, the leader sends <switch-to-new-config>
transaction to M and M' (instead of sending retire just to M).

After receiving this message, M' will close connections with (and reject
connections from) members not in M'.  Members that are supposed to leave the
cluster will shutdown QuorumPeer. If leader(M) is not in M', then a new
leader(M') will be elected.

Let me know what you think about this.

H. What happens if a majority of M' keep failing during reconfig?

M={A, B}, M'={A, C, D, E}. What if D, E fail?

Failure of a majority of M' will permanently stall reconfig. While this is
less likely to happen, I think ZooKeeper should handle this
automatically. After a few retries, we should abort reconfig.  Otherwise, we
could disrupt a running cluster and we will never be able to recover without
manual intervention. If the majority fails after phase 1, then this would
mean
sending a <abort-reconfig, version(M), M'> to M.

Of course, one can argue - what if majority of M' fail after phase 3? So not
sure if this is an overkill, but I feel we should handle this case.

I. "Old reconfiguration request"

a. We should use ZAB
b. A side note - I think ZOOKEEPER-335 needs to be resolved for reconfig to
work. This bug causes logs to diverge if ZK leader fails before sending
PROPOSALs to followers (see
http://www.mail-archive.com/[EMAIL PROTECTED]/msg00403.html).

Because of this bug we could run into the following scenario:
- A peer B that was leader when reconfig(M') was issued will have reconfig
M'
  in its transaction log.
- A peer C that became leader after B's failure, can have reconfig(M'') in
its
  log.
- Now, if B and C fail (say both reboot), then the outcome of reconfig will
  depend on which node takes leadership. If B becomes a leader, then out
come
is M'. If C becomes