-Re: [hbase-5487] Master design discussion.
Lars Hofhansl 2013-10-19, 22:10
Didn't read the spec, yet.
My main gripe currently is that there are too many places holding the state: the fs, .meta., zk, copies in ram at both master and region servers, etc. Some of the problems we've seen were due to the various copy of this state getting out of sync.
Now I'll shut up and read the speed.
Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
>Responses inline. I'm going to wait for next doc before I ask more
>questions about the master5 specifics. :)
>I suggest we start a thread about the hang/double assign/fencing scenario.
>On Sat, Oct 19, 2013 at 7:17 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote:
>> Here's sergey's replies from jira (modulo some reformatting for email.)
>> Answers lifted from email also (some fixes + one answer was modified due
>> to clarification here ).
>> What is a failure and how do you react to failures? I think the master5
>>> design needs to spend more effort to considering failure and recovery
>>> cases. I claim there are 4 types of responses from a networked IO operation
>>> - two states we normally deal with ack successful, ack failed (nack) and
>>> unknown due to timeout that succeeded (timeout success) and unknown due to
>>> timeout that failed (timeout failed). We have historically missed the last
>>> two cases and they aren't considered in the master5 design.
>> There are a few considerations. Let me examine if there are other cases
>> than these.
>> I am assuming the collocated table, which should reduce such cases for
>> state (probably, if collocated table cannot be written reliably, master
>> must stop-the-world and fail over).
>For the master, I agree. In the case of nack it knows it failed and should
>failover, in the case of timeouts it should try verify and retry and if it
>cannot it should abdicate. This seems equivalent to the 9x-master's "if we
>can't get to zk we abort" behavior. We need to guarantee that the old
>master and new master are fenced off from each other to make sure they
>don't split brain.
>Off list, you suggested wal fencing and hbck-master sketched out a lease
>timeouts mechanism that provides isolation. Correctness in the face of
>hangs and partitions are my biggest concern, and a lot of the questions I'm
>asking focus on these scenarios.
>Can you point me to where I read and consider the wal fencing? Do you have
>thoughts or a comparison of these approaches? Are there others techniques?
>> When RS contacts master to do state update, it errs on the side of caution
>> - no state update, no open region (or split).
>not sure what "it" in the second part of the sentence means here -- is this
>the master or the rs? If it is the master I this is what hbck-master calls
>an "actual state update".
>If "it" mean RS's, I'm confused.
>> Thus, except for the case of multiple masters running, we can always
>> assume RS didn't online the region if we don't know about it.
>I don't think that is a safe assumption -- I still think double assignment
>can and will happen. If an RS hangs long enough for the master to think it
>is dead, the master would reassign. If that RS came back, it could still
>be open handling clients who cached the old location.
> Then, for messages to RS, see "Note on messages"; they are idempotent so
>> they can always be resent.
>> 1) State update coordination. What is a "state updates from the outside"
>>> Do RS's initiate splitting on their own? Maybe a picture would help so we
>>> can figure out if it is similar or different from hbck-master's?
>> Yes, these are RS messages. They are mentioned in some operation
>> descriptions in part 2 - opening->opened, closing->closed; splitting, etc.
>> [will revisit when part 2 posted]
>> 2) Single point of truth. hbck-master tries to define what single point of
>>> truth means by defining intended, current, and actual state data with
>>> durability properties on each kind. What do clients look at who modifies