We're moving a discussion started in HBASE-5487 (
https://issues.apache.org/jira/browse/HBASE-5487) based on stack's
Master5 refers to the design doc sergey posted, and hbck-master refers to
the design doc that I posted. (I'd suggest refer to them by the doc name
as opposed to whomever wrote it).
Here's the questions I posed to sergey:
I have a lot of questions. I'll hit the big questions first. Also would i
be possible to put a version of this up as gdoc so we can point out nits
and places that need minor clarification? (I have a marked up physical copy
version of the doc, would be easier to provide feedback).
What is a failure and how do you react to failures? I think the master5
design needs to spend more effort to considering failure and recovery
cases. I claim there are 4 types of responses from a networked IO operation
- two states we normally deal with ack successful, ack failed (nack) and
unknown due to timeout that succeeded (timeout success) and unknown due to
timeout that failed (timeout failed). We have historically missed the last
two timeout cases or assumed timeout means failure nack. It seems that
master5 makes the same assumptions.
I'm very concerned about what we need to do to invalidate information
cached RS information at clients in the case of hang, and that will violate
the isolation guarantees that we claim to provide. I really want a slice
in-depth failure handling case analysis including a client with cached rs
assignments for move and something more complicated such as split or alter.
I really want more invariant specified for the FSM states. e.g. if a region
is in state X, does it have a row in meta? does have data on the FS? is it
open on another region? is it open on only one region? I think having 8
pages of tables at the back of the master5 doc can be more concise and
precise which will help us get attempt to prove correctness.
1) State update coordination. What is a "state updates from the outside" Do
RS's initiate splitting on their own? Maybe a picture would help so we can
figure out if it is similar or different from hbck-master's?
2) Single point of truth. What is this truth? what the user specficied
actions? what the rs's are reporting? the last state we were confirmed to
be at? hbck-master tries to define what single point of truth means by
defining intended, current, and actual state data with durability
properties on each kind. What do clients look at who modifies what?
3) Table record: "if regions is out of date, it should be closed and
reopened". It is not clear in master5 how regionservers find out that they
are out of date. Moreover, how do clients talking to those RS's with stale
versions know they are going to the correct RS especially in the face of RS
failures due to timeout?
4) region record: transition states. Shouldn't be defined as part of the
region record? (This is really similar to hbck-masters current state and
intended state. )
5) Note on user operations: the forgetting thing is scary to me – in your
move split example, what happens if an RS reads state that is forgotten?
6) table state machine. how do we guarantee clients are not writing to
against out of date region versions? (in hang situations, regions could be
open on multple places – the hung RS and the new RS the region was assigned
to and successfully opened on)
7) region state machine. Earlier draft hand splitting and merge cases. Are
they elided in master5 or are not present any more. How would this get
extended handle jeffrey's distributed log replay/fast write recovery
8) logical interactions: sounds like master5 allows concurrent operations
in specfiic regions and and specfiic table. (e.g. it will allow moves and
splits and merges on the same region). hbck-master (though not fully
documented) only allows certain region transitions when the table is
enabled or if the table is disabled. Are we sure we don't get into race
conditions? What happens if disable gets issued – its possible for someone
to reopens the region and for old clients to continue writing to it even
though it is closed?
nit. 9) "in cursive" mean in italics.
10) The table operations section have tables which I believe are the
actions between FSM states in the table or region fsms. Is this correct?
Can the edges be labeled to describe which steps these transitions
nit: Design Constraints, code should: Have AM logic isolated from the
persistent storage of state.
// I think this should be "abstracted" so we can plug in different
implementations of persistent storage of state.
// Jonathan Hsieh (shay)
// Software Engineer, Cloudera
// [EMAIL PROTECTED]