Here's sergey's replies from jira (modulo some reformatting for email.)
Answers lifted from email also (some fixes + one answer was modified due to
clarification here ).
What is a failure and how do you react to failures? I think the master5
> design needs to spend more effort to considering failure and recovery
> cases. I claim there are 4 types of responses from a networked IO operation
> - two states we normally deal with ack successful, ack failed (nack) and
> unknown due to timeout that succeeded (timeout success) and unknown due to
> timeout that failed (timeout failed). We have historically missed the last
> two cases and they aren't considered in the master5 design.
There are a few considerations. Let me examine if there are other cases
I am assuming the collocated table, which should reduce such cases for
state (probably, if collocated table cannot be written reliably, master
must stop-the-world and fail over).
When RS contacts master to do state update, it errs on the side of caution
- no state update, no open region (or split).
Thus, except for the case of multiple masters running, we can always assume
RS didn't online the region if we don't know about it.
Then, for messages to RS, see "Note on messages"; they are idempotent so
they can always be resent.
1) State update coordination. What is a "state updates from the outside" Do
> RS's initiate splitting on their own? Maybe a picture would help so we can
> figure out if it is similar or different from hbck-master's?
Yes, these are RS messages. They are mentioned in some operation
descriptions in part 2 - opening->opened, closing->closed; splitting, etc.
2) Single point of truth. hbck-master tries to define what single point of
> truth means by defining intended, current, and actual state data with
> durability properties on each kind. What do clients look at who modifies
Sorry, don't understand the question. I mean single source of truth mainly
about what is going on with the region; it is described in design
I like the idea of "intended state", however without more detailed reading
I am not sure how it works for multiple ops e.g. master recovering the
region while the user intends to split it, so the split should be executed
after it's opened.
3) Table record: "if regions is out of date, it should be closed and
> reopened". It is not clear in master5 how regionservers find out that they
> are out of date. Moreover, how do clients talking to those RS's with stale
> versions know they are going to the correct RS especially in the face of RS
> failures due to timeout?
On alter (and startup if failed), master tries to reopen all regions that
are out of date.
Regions that are not opened with either pick up the new version when they
are opened, or (e.g. if they are now Opening with old version) master
discovers they are out of date when they are transitioned to Opened by RS,
and reopens them again.
As for any case of alter on enabled table, there are no guarantees for
To provide these w/o disable/enable (or logical equivalent of coordinating
all close-s and open-s), one would need some form of version-time-travel,
or waiting for versions, or both.
4) region record: transition states. This is really similar to hbck-masters
> current state and intended state. Shouldn't be defined as part of the
> region record?
I mention somewhere that could be done. One thing is that if several paths
are possible between states, it's useful to know which is taken.
But do note that I store user intent separately from what is currently
going on, so they are not exactly similar as far as I see.
5) Note on user operations: the forgetting thing is scary to me – in your
> move split example, what happens if an RS reads state that is forgotten?
I think my description of this might be too vague. State is not forgotten;
previous intent is forgotten. I.e. if user does several operations in order
that conflict (e.g. split and then merge), the first one will be canceled
Also, RS does not read state as a guideline to what needs to be done.
6) table state machine. how do we guarantee clients are writing from the
Can you please elaborate?
7) region state machine. Earlier draft hand splitting and merge cases. Are
As I mention somewhere these could be separate states. I was kind of afraid
of blowing up state machine too much, so I noticed that for split/merge you
anyway store siblings/children, so you can recognize them and for most
purposes different split-merge states are the same as Opened and Closed.
I will add those back, it would make sense.
8) logical interactions: sounds like master5 allows concurrent region and
Yes, parallelism is intended. You can never be sure you have no races but
we should aim for it
master5 is missing disabled/enabled check, that is a mistake.
Part1 operation interactions already cover it:
table disable doesn't ack until all regions are closed (master5 is wrong ).
region opening cannot start if table is already disabling or disabled.
if region is already opening when disable is issued, opening will be
if disable fails to cancel opening, or server opens it first in a race,
region will be opened, and master will issue close immediately after state
update. Given that region is not closed, disable is not complete.
if opening (or closing) times out, master will fence off RS and mark region
as closed. If there was some way of fencing region separately (ZK lease?)
it would be possible to use that.
In any case, until client checks table state before every write, there's no
easy way to prevent writes on disabling table. Writes on disabled table
will not be possible.
On ensuring there's no double assignment due to RS hanging:
The intent is to fence the WAL for region server, the way we do now. One
could also use other mechanism.
Perhaps I could specify it more clearly; I think the problem of making sure
RS is dead is nearl