-Re: [hbase-5487] Master design discussion.
Sergey Shelukhin 2013-10-22, 23:56
I'll try to update the doc tomorrow before the meetup..
> I suggest we start a thread about the hang/double assign/fencing scenario.
Fencing is already covered by WAL fencing. After WAL has been fenced, RS
cannot write even if it un-hangs, so I think it's not a problem.
> Can you point me to where I read and consider the wal fencing? Do you
>thoughts or a comparison of these approaches? Are there others techniques?
>... We need to guarantee that the old master and new master are fenced off
>from each other to make sure they don't split brain.
Do you mean WAL fencing for RS or master? I meant master fencing off the
That is already implemented, this is what we do now.
slides 20-23 have the description.
Master can also fence off the previous master I guess. Actually I like it
better than ZK lease.
If we fenced off the WAL for previous master then we know it cannot write,
assuming we use system table of course.
For master - do you think the state management/transitions should be
to double-assignment of system region (two active masters)? It seems simpler
to implement outside of state machine. ZK lease could be one option, but
master5 or part1 don't cover master-master conflict.
Do you think they should?
>> When RS contacts master to do state update, it errs on the side of
>> - no state update, no open region (or split).
>not sure what "it" in the second part of the sentence means here -- is this
>the master or the rs? If it is the master I this is what hbck-master calls
>an "actual state update".
> If "it" mean RS's, I'm confused.
RS. RS completes all region open operations, where onlining it is a boolean
(putting it in the map or such), then updates the state via master.
If it does update the state, master would know it. If it cannot, it doesn't
finish onlining the region and closes it instead.
That way, we are sure the region is not opened unless we have received the
opened message at some point.
>I don't think that is a safe assumption -- I still think double assignment
>can and will happen. If an RS hangs long enough for the master to think it
>is dead, the master would reassign. If that RS came back, it could still
>be open handling clients who cached the old location.
Not for writes (see fencing). For reads, it's nigh impossible to prevent
keeping reasonable MTTR and performance.
> I'm suggesting there are multiple version of truth since we have multiple
>actors. What the master thinks and knows is one kind (hbck-master:
>currentState), what the master wants is another (hbck-master:
>and what the regionservers think are another (hbck-mater: actualState).
>master doesn't know exactly what is really going on (it could be
>from some of the RS's that are still actively serving clients!).
> Rephrase of question: "What do clients look at, and who modifies the
Clients look at system table (via master, presumably). Single point of truth
in my view is for sum total state pertaining to the actions on the region.
is modified by both RS and master in some orderly, as well as atomic
(CAS-like) way. So, region operations do not need to take into account what
other actors are doing. E.g. they don't need to check some other nodes in
ZK, separate in-memory state, etc.; ditto for recovery. So, this covers
current and planned "logical state" from hbck-master. As for actual state, I
am not sure it needs to be covered. Actual state is ephemeral and is dealt
with based on notifications/responses from RSes/etc. It is covered in
operation descriptions, I guess.
>> As for any case of alter on enabled table, there are no guarantees for
>>clients. To provide these w/o disable/enable (or logical equivalent of
>>coordinating all close-s and open-s), one would need some form of
>>version-time-travel, or waiting for versions, or both.
>By version time travel you mean something like this right?
Actually that is what I meant by "version time travel" :) Client reads
previous versions of the regions. That is possible, but again orthogonal
to state or version management.
It is also not so simple to do - it's distributed MVCC, essentially. Plus, I
am not 100% sure of all the ways descriptor can affect client, and what
the support in HRegion will look like - will it be as simple as keeping
descriptors by version, or will there be larger changes.
It is compatible with the update scheme proposed, anyway.
Current model for online schema changes is that client just reads whatever
There are two different levels of intent - the target of the already started
operation, and the next operation, e.g. opening region here and then user
wants to merge it. Unless I misunderstand target states, it would only be
to store one - either "Opened" or "Merging", right?
Imho you need to store both things, opening now, merge then (may not be able
to merge before opening). Or will master reconstruct the path to final
Basically the region splitting/merging is still opened and can be read. From
client perspective it's opened; from master perspective it can do the same
of ops on it after canceling the split/merge activity. E.g. master move
the splitting region to closing just as well as opened, and split will fail
update state in the end and be abandoned. It's more of a notification
of something cooking for the opened region (or closed in case of some of the
Split will change the state of the region to splitting and create daughters,
merge will change the state of the regions to merging, and create daughter.
Given CAS semantics (MCAS actually, using multi-row tx-es), one of the state
changes will fail (on low level). Then, it might decide to cancel the other
operation (say, RS is splitting and the user wants to merge).
Then, it will change the