Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
HBase, mail # dev - Re: [hbase-5487] Master design discussion.


+
Lars Hofhansl 2013-10-19, 22:10
+
Sergey Shelukhin 2013-10-22, 21:40
+
Sergey Shelukhin 2013-10-22, 23:56
+
Jonathan Hsieh 2013-10-19, 14:12
+
Jonathan Hsieh 2013-10-19, 14:17
Copy link to this message
-
Re: [hbase-5487] Master design discussion.
Jonathan Hsieh 2013-10-19, 15:59
Responses inline.   I'm going to wait for next doc before I ask more
questions about the master5 specifics. :)

I suggest we start a thread about the hang/double assign/fencing scenario.

Jon.

On Sat, Oct 19, 2013 at 7:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:

> Here's sergey's replies from jira (modulo some reformatting for email.)
>
> ----
> Answers lifted from email also (some fixes + one answer was modified due
> to clarification here ).
>
> What is a failure and how do you react to failures? I think the master5
>> design needs to spend more effort to considering failure and recovery
>> cases. I claim there are 4 types of responses from a networked IO operation
>> - two states we normally deal with ack successful, ack failed (nack) and
>> unknown due to timeout that succeeded (timeout success) and unknown due to
>> timeout that failed (timeout failed). We have historically missed the last
>> two cases and they aren't considered in the master5 design.
>
>
> There are a few considerations. Let me examine if there are other cases
> than these.
> I am assuming the collocated table, which should reduce such cases for
> state (probably, if collocated table cannot be written reliably, master
> must stop-the-world and fail over).
>

For the master, I agree.  In the case of nack it knows it failed and should
failover, in the case of timeouts it should try verify and retry and if it
cannot it should abdicate.  This seems equivalent to the 9x-master's "if we
can't get to zk we abort" behavior.  We need to guarantee that the old
master and new master are fenced off from each other to make sure they
don't split brain.

Off list, you suggested wal fencing and hbck-master sketched out a lease
timeouts mechanism that provides isolation. Correctness in the face of
hangs and partitions are my biggest concern, and a lot of the questions I'm
asking focus on these scenarios.

Can you point me to where I read and consider the wal fencing?  Do you have
thoughts or a comparison of these approaches?  Are there others techniques?
> When RS contacts master to do state update, it errs on the side of caution
> - no state update, no open region (or split).
>

not sure what "it" in the second part of the sentence means here -- is this
the master or the rs?  If it is the master I this is what hbck-master calls
an "actual state update".

If "it" mean RS's, I'm confused.
> Thus, except for the case of multiple masters running, we can always
> assume RS didn't online the region if we don't know about it.
>

I don't think that is a safe assumption -- I still think double assignment
can and will happen.  If an RS hangs long enough for the master to think it
is dead, the master would reassign.  If that RS came back, it could still
be open handling clients who cached the old location.

 Then, for messages to RS, see "Note on messages"; they are idempotent so
> they can always be resent.
>
>
> 1) State update coordination. What is a "state updates from the outside"
>> Do RS's initiate splitting on their own? Maybe a picture would help so we
>> can figure out if it is similar or different from hbck-master's?
>
>
> Yes, these are RS messages. They are mentioned in some operation
> descriptions in part 2 - opening->opened, closing->closed; splitting, etc.
>
> [will revisit when part 2 posted]
> 2) Single point of truth. hbck-master tries to define what single point of
>> truth means by defining intended, current, and actual state data with
>> durability properties on each kind. What do clients look at who modifies
>> what?
>
>
> Sorry, don't understand the question. I mean single source of truth mainly
> about what is going on with the region; it is described in design
> considerations.
> I like the idea of "intended state", however without more detailed reading
> I am not sure how it works for multiple ops e.g. master recovering the
> region while the user intends to split it, so the split should be executed
> after it's opened.
>
> I'm suggesting there are multiple version of truth since we have multiple
actors.  What the master thinks and knows is one kind (hbck-master:
currentState), what the master wants is another (hbck-master:
intendedState), and what the regionservers think are another (hbck-mater:
actualState).  The master doesn't know exactly what is really going on (it
could be partitioned from some of the RS's that are still actively serving
clients!).

Rephrase of question:  "What do clients look at, and who modifies the
truth?".
succeptable to the hang/double assign scenario I described a few questions
earlier).
To provide these w/o disable/enable (or logical equivalent of coordinating

By version time travel you mean something like this right?
client gets from region R1 with version 1.
client gets from region R2 with version 1.
alter initiated.
R1 updated to v2.
client gets from R1 with version 2.
client gets from R2 with version 1 (version time travel here).
R2 updated to V2.

I think we'd want to wait instead of allowing version time-travel and pass
version info to the client that it needs to pass to RS requests.  This
would prevent version-time-travel which could potentially lead to data
hazard situations.  (or maybe online alter should not be allowed to
add/remove cfs -- only modify their settings).
I think that user intent and system intent can probably be place in the
same place with just a flag to differentiate (that probably gives user
intent priority).
of canceling however, hbck-master would figure out how to recover to the
latest intended state.  How this is done in hbck-master is generic but its
a little fuzzy on the details currently (where is it a FSM and where is it
a push-down automata (PDA))..
state -- what are these invariants?  I felt they were different because
they do a different set of IO operations.
You can prove that you won't be affected by races and can eliminate cases
where you are.

Let me rephrase as a few scenarios:
* what happens if multiple operations