|
|
-
Hbase Assignments in trunk.
Jonathan Hsieh 2012-09-05, 19:38
I generally think in pictures, so I've mapped out the single Assignment control flow as found in trunk yesterday in terms of threads and network communications (each of which can possibly fail). It is a process that has 18 or so network communications, 3 processes, and about 8 threads coordinating (excluding meta writes)
I wanted to put this out because we've had some discussions about simplifying it or making it more accessible so we can comfortably access patches and possibly use it as a rough design doc or a counter to new potential strawman designs. For me at least it would be useful when reviewing patches in this area.
We've also talked about defining design and code invariants -- here's the one that I've gotten so far: (We can pull up more from discussion)
* ZK state should transient (treat it like memory). If deleted, hbase should be able to recover and essentially be in the same state (a few exceptions -- enabled/disable state)
A few questions I have from this exercise:
1) Why do we have ZK asynchronously update the HM? (why not do it synchronously?) 2) Why do we have the RS update ZK as it opens -- why not have the HM manage all ZK comms and not have the RS talk directly to ZK in this process? Then ZK is just for failover and less so for coordination. 3) Clients who issue assign calls are partially asynchronous and partially synchronous. Why not go all the way? 4) Why are there multiple error conventions -- abort, FAILED_OPEN, throwing exception, (and cases where we "return" silently without notification)? 5) How do we handle timeout situations -- IMO it makes sense to have a rollback or fail forward policy for different places on the timeline. 6) Can we use cancellation instead of checking for enabling/disabled/disabling/shutdown/stopping all over the place? (let's say these cluster ops would cancel the assign and then win by blocking assigns). 7) In memory state has different but similarly named states in the HM, ZK, and in the RS's. And there are the transition events could be missed. 8) Is having multiple processes "responsible for acting" necessary? (why not have the HM open and then update meta)?
Thoughts? (and corrections please!)
Jon. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Hbase Assignments in trunk.
Stack 2012-09-05, 23:08
On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > I generally think in pictures, so I've mapped out the single Assignment > control flow as found in trunk yesterday in terms of threads and network > communications (each of which can possibly fail). It is a process that has > 18 or so network communications, 3 processes, and about 8 threads > coordinating (excluding meta writes) >
Did you attach your picture Jon? > We've also talked about defining design and code invariants -- here's the > one that I've gotten so far: (We can pull up more from discussion) > > * ZK state should transient (treat it like memory). If deleted, hbase should > be able to recover and essentially be in the same state (a few exceptions -- > enabled/disable state) >
Yes.
We should post these invariants somewhere? In dev section of refguide? > A few questions I have from this exercise: > > 1) Why do we have ZK asynchronously update the HM? (why not do it > synchronously?)
IIRC, it was faster. > 2) Why do we have the RS update ZK as it opens -- why not have the HM manage > all ZK comms and not have the RS talk directly to ZK in this process? Then > ZK is just for failover and less so for coordination.
IIRC, the notion was that we could keep an eye on the regionserver progress opening a region. RS could take a long time opening and as long as it was tickling zk by resetting state, the master would not take control of the region away from the RS. Inversely, if the RS froze mid-open, it'd know it lost control if when it tried to set state, the sequence id had moved on from what it thought it was.
> 3) Clients who issue assign calls are partially asynchronous and partially > synchronous. Why not go all the way?
No reason. The thought was async meant less friction. The work was just never done to async it all.
> 4) Why are there multiple error conventions -- abort, FAILED_OPEN, throwing > exception, (and cases where we "return" silently without notification)?
I would have to look at the particular instance but high level I'd say its a case of:
1. On the one hand your classic myopic patch-centric view 2. While on the other, you can't throw an exception out to the master if the rpc open has been successfully handed off and the rpc has completed... there needs to be another means flagging error.
> 5) How do we handle timeout situations -- IMO it makes sense to have a > rollback or fail forward policy for different places on the timeline.
Yes. There are a couple of flavors of this in the code base at present. Could do w/ a revisit for sure.
> 6) Can we use cancellation instead of checking for > enabling/disabled/disabling/shutdown/stopping all over the place? (let's say > these cluster ops would cancel the assign and then win by blocking assigns).
The enabling, etc., checks are done on assign to make sure we don't go ahead if table state has changed since the order to assign was given.
To me cancel seems like something else; the open or close has gone out already and we want to stop it happening.
They seem like different things to me.
> 7) In memory state has different but similarly named states in the HM, ZK, > and in the RS's. And there are the transition events could be missed.
Yes. This is a problem.
My peeve is the one where we cannot trust what RegionState says and even if we could, its states are not 'clean'; e.g. OFFINE is both BEGIN the open of a region but also a catchall parking state that we put regions into when not sure what else to do w/ them.
> 8) Is having multiple processes "responsible for acting" necessary? (why > not have the HM open and then update meta)? >
It could be good having master do all meta edits. Would be good to see what advantage it would bring us before going about making the change.
I can provide more history and provenance if needed, np.
St.Ack
-
Re: Hbase Assignments in trunk.
Jonathan Hsieh 2012-09-06, 00:07
On Wed, Sep 5, 2012 at 4:08 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > I generally think in pictures, so I've mapped out the single Assignment > > control flow as found in trunk yesterday in terms of threads and network > > communications (each of which can possibly fail). It is a process that > has > > 18 or so network communications, 3 processes, and about 8 threads > > coordinating (excluding meta writes) > > > > Did you attach your picture Jon? >
I attached a 571k pdf. If it didn't get through, I'll post it somewhere so folks can see it. -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Hbase Assignments in trunk.
Jonathan Hsieh 2012-09-06, 00:17
Here's a link to the pdf/picture. http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdfJon. On Wed, Sep 5, 2012 at 5:07 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > On Wed, Sep 5, 2012 at 4:08 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: >> > I generally think in pictures, so I've mapped out the single Assignment >> > control flow as found in trunk yesterday in terms of threads and network >> > communications (each of which can possibly fail). It is a process that >> has >> > 18 or so network communications, 3 processes, and about 8 threads >> > coordinating (excluding meta writes) >> > >> >> Did you attach your picture Jon? >> > > I attached a 571k pdf. If it didn't get through, I'll post it somewhere > so folks can see it. > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED] > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Hbase Assignments in trunk.
n keywal 2012-09-06, 07:53
On the Async vs. sync: there are 3 different ways to write multiple znodes in ZK, and huge differences in the performances between them: 1) for loop sync 2) for loop async 3) multi Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times faster than async (that is, 80 to 400 times faster than sync). Multi was not available before ZK 3.4. It has several obvious advantages over async imho: it's faster, it's synchronous and it's a transaction. That simplifies the user code usually. It has other advantages: - async and sync will typically send 1 or more packet per znode (naggle is not activated iirc), while there will be only a few packets for all the znodes with multi - you can expect async to write multiple times on the disk, while multi should write only once. This is usually better for i/o performances. On a serious recovery situation, with all the regions moving all other the place, saving disk and network i/o for ZooKeeper is important. Disadvantage: it's new. On Thu, Sep 6, 2012 at 7:49 AM, Stack <[EMAIL PROTECTED]> wrote: > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > Here's a link to the pdf/picture. > > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf> > > > Pretty picture. Not a pretty story. > > What you thinking? > > St.Ack >
-
Re: Hbase Assignments in trunk.
Jonathan Hsieh 2012-09-06, 10:16
On Wed, Sep 5, 2012 at 4:08 PM, Stack <[EMAIL PROTECTED]> wrote:
> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > ...
> > We've also talked about defining design and code invariants -- here's the > > one that I've gotten so far: (We can pull up more from discussion) > > > > * ZK state should transient (treat it like memory). If deleted, hbase > should > > be able to recover and essentially be in the same state (a few > exceptions -- > > enabled/disable state) > > > > Yes. > > We should post these invariants somewhere? In dev section of refguide? > > We should definitely put this in the javadoc. Maybe we should have a dev-guide section of the ref-guide where some of these things are also captured? > > 4) Why are there multiple error conventions -- abort, FAILED_OPEN, > throwing > > exception, (and cases where we "return" silently without notification)? > > I would have to look at the particular instance but high level I'd say > its a case of: > > 1. On the one hand your classic myopic patch-centric view > 2. While on the other, you can't throw an exception out to the master > if the rpc open has been successfully handed off and the rpc has > completed... there needs to be another means flagging error. > > On a code craft point of view, failure behavior is buried deeply and could be pulled out to the process methods of the handlers. In many cases, it isn't easy to figure out why one behavior is chosen vs others. > > 5) How do we handle timeout situations -- IMO it makes sense to have a > > rollback or fail forward policy for different places on the timeline. > > Yes. There are a couple of flavors of this in the code base at > present. Could do w/ a revisit for sure. > > This is more a question -- I'm not familiar with the details of rpc timeouts currently. > > 6) Can we use cancellation instead of checking for > > enabling/disabled/disabling/shutdown/stopping all over the place? (let's > say > > these cluster ops would cancel the assign and then win by blocking > assigns). > > The enabling, etc., checks are done on assign to make sure we don't go > ahead if table state has changed since the order to assign was given. > > To me cancel seems like something else; the open or close has gone out > already and we want to stop it happening. > > They seem like different things to me. > > I'm suggesting that when a overriding operation like enable/disable/shutdown/stop is triggered we internally use cancellation to preemmpt assignments/unassignments. This could be in the same places where we currently do the checks, but also eventually be used to cancel open/close operations. Maybe this is too far out for the time being. > > 7) In memory state has different but similarly named states in the HM, > ZK, > > and in the RS's. And there are the transition events could be missed. > > Yes. This is a problem. > > My peeve is the one where we cannot trust what RegionState says and > even if we could, its states are not 'clean'; e.g. OFFINE is both > BEGIN the open of a region but also a catchall parking state that we > put regions into when not sure what else to do w/ them. > > There is the state name (i agree). Also, there is the fact that RegionState is not always right (possibly more than one state transition behind). This is actually why I was considering taking the zk-based control flow elements and putting them in the master. If states are skipped we need to make sure the transitions happen on the master (or we can safely skip the transition).
I'm also suggesting that we could avoid using ZK event callbacks like the OPENING and OPENED zk transition and instead have the master would manage those. We'd have an opening RS would tickle some other znode to show progress. At least then RegionState would be closer to accurate, and the HM would go through all state transitions.
> > 8) Is having multiple processes "responsible for acting" necessary? (why > > not have the HM open and then update meta)? I'm pretty sure it would have more latency. Controlling when the becomes a assigned region availabile might make this trickier. (Jimmy caught a bug in an earlier version of this).
// Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Hbase Assignments in trunk.
Jonathan Hsieh 2012-09-06, 10:20
IMO, moving to new ZK seems to makes sense for HBase trunk. Jon. On Thu, Sep 6, 2012 at 12:53 AM, n keywal <[EMAIL PROTECTED]> wrote: > On the Async vs. sync: there are 3 different ways to write multiple znodes > in ZK, and huge differences in the performances between them: > > 1) for loop sync > 2) for loop async > 3) multi > > Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times > faster than async (that is, 80 to 400 times faster than sync). > > Multi was not available before ZK 3.4. It has several obvious advantages > over async imho: it's faster, it's synchronous and it's a transaction. That > simplifies the user code usually. > > It has other advantages: > - async and sync will typically send 1 or more packet per znode (naggle is > not activated iirc), while there will be only a few packets for all the > znodes with multi > - you can expect async to write multiple times on the disk, while multi > should write only once. This is usually better for i/o performances. > > On a serious recovery situation, with all the regions moving all other the > place, saving disk and network i/o for ZooKeeper is important. > > Disadvantage: it's new. > > On Thu, Sep 6, 2012 at 7:49 AM, Stack <[EMAIL PROTECTED]> wrote: > > > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Here's a link to the pdf/picture. > > > > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf> > > > > > > Pretty picture. Not a pretty story. > > > > What you thinking? > > > > St.Ack > > > -- // Jonathan Hsieh (shay) // Software Engineer, Cloudera // [EMAIL PROTECTED]
-
Re: Hbase Assignments in trunk.
Dave Wang 2012-09-06, 14:36
There's a discussion on the ZK mailing list about releasing ZK 3.4.4, which will have multi and some other fixes. Once that is out, we can move to that on trunk. That will also help with one of the replication patches that Himanshu currently has pending, which relies on multi. - Dave On Thu, Sep 6, 2012 at 3:20 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > IMO, moving to new ZK seems to makes sense for HBase trunk. > > Jon. > > On Thu, Sep 6, 2012 at 12:53 AM, n keywal <[EMAIL PROTECTED]> wrote: > > > On the Async vs. sync: there are 3 different ways to write multiple > znodes > > in ZK, and huge differences in the performances between them: > > > > 1) for loop sync > > 2) for loop async > > 3) multi > > > > Async will be 20 to 100 times faster than sync. multi will be 2 to 4 > times > > faster than async (that is, 80 to 400 times faster than sync). > > > > Multi was not available before ZK 3.4. It has several obvious advantages > > over async imho: it's faster, it's synchronous and it's a transaction. > That > > simplifies the user code usually. > > > > It has other advantages: > > - async and sync will typically send 1 or more packet per znode (naggle > is > > not activated iirc), while there will be only a few packets for all the > > znodes with multi > > - you can expect async to write multiple times on the disk, while multi > > should write only once. This is usually better for i/o performances. > > > > On a serious recovery situation, with all the regions moving all other > the > > place, saving disk and network i/o for ZooKeeper is important. > > > > Disadvantage: it's new. > > > > On Thu, Sep 6, 2012 at 7:49 AM, Stack <[EMAIL PROTECTED]> wrote: > > > > > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > > > > Here's a link to the pdf/picture. > > > > > > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf> > > > > > > > > > Pretty picture. Not a pretty story. > > > > > > What you thinking? > > > > > > St.Ack > > > > > > > > > -- > // Jonathan Hsieh (shay) > // Software Engineer, Cloudera > // [EMAIL PROTECTED] >
-
Re: Hbase Assignments in trunk.
Stack 2012-09-06, 20:28
On Thu, Sep 6, 2012 at 3:16 AM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > On Wed, Sep 5, 2012 at 4:08 PM, Stack <[EMAIL PROTECTED]> wrote: > >> On Wed, Sep 5, 2012 at 12:38 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: ... >> We should post these invariants somewhere? In dev section of refguide? >> >> We should definitely put this in the javadoc. Maybe we should have a > dev-guide section of the ref-guide where some of these things are also > captured? >
I added an invariants section to the developer pages. I used your wording of the zk data axiom above.
(What other invariants do we have?)
>> On a code craft point of view, failure behavior is buried deeply and could > be pulled out to the process methods of the handlers. In many cases, it > isn't easy to figure out why one behavior is chosen vs others. >
Nod.
> I'm also suggesting that we could avoid using ZK event callbacks like the > OPENING and OPENED zk transition and instead have the master would manage > those. We'd have an opening RS would tickle some other znode to show > progress. At least then RegionState would be closer to accurate, and the > HM would go through all state transitions. >
Perhaps.
I would look at any prospective design to see if I could see holes where master and regionserver might diverge in terms of what they think a particular region's state is at any one time (Up to this, they've done it via the znode proxy that one or the other purportedly owns outright at any time; there is even some facility for progressing in the face of missed callbacks though for sure we are now into a gray area).
St.Ack
-
Re: Hbase Assignments in trunk.
lars hofhansl 2012-09-10, 22:29
I've been saying a while ago that we should require ZK 3.4.x for 0.96+. Distributed consensus without a "transaction" option always rang a bit weird to me. Maybe switch in 0.98+? -- Lars ----- Original Message ----- From: n keywal <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Thursday, September 6, 2012 12:53 AM Subject: Re: Hbase Assignments in trunk. On the Async vs. sync: there are 3 different ways to write multiple znodes in ZK, and huge differences in the performances between them: 1) for loop sync 2) for loop async 3) multi Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times faster than async (that is, 80 to 400 times faster than sync). Multi was not available before ZK 3.4. It has several obvious advantages over async imho: it's faster, it's synchronous and it's a transaction. That simplifies the user code usually. It has other advantages: - async and sync will typically send 1 or more packet per znode (naggle is not activated iirc), while there will be only a few packets for all the znodes with multi - you can expect async to write multiple times on the disk, while multi should write only once. This is usually better for i/o performances. On a serious recovery situation, with all the regions moving all other the place, saving disk and network i/o for ZooKeeper is important. Disadvantage: it's new. On Thu, Sep 6, 2012 at 7:49 AM, Stack <[EMAIL PROTECTED]> wrote: > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > Here's a link to the pdf/picture. > > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf> > > > Pretty picture. Not a pretty story. > > What you thinking? > > St.Ack >
-
Re: Hbase Assignments in trunk.
Enis Söztutar 2012-09-11, 02:45
+1 on rethinking the assignment + splitting code paths, and using zk as a transactional database. Just my 2 cents w/o spending a lot of time on the details, but maybe we should stop keeping master and RS in memory metadata, but keep region-assignments in zk, and HM and RS just keep a consistent in-memory cache. Enis On Mon, Sep 10, 2012 at 3:29 PM, lars hofhansl <[EMAIL PROTECTED]> wrote: > I've been saying a while ago that we should require ZK 3.4.x for 0.96+. > > Distributed consensus without a "transaction" option always rang a bit > weird to me. > > Maybe switch in 0.98+? > > -- Lars > > > ----- Original Message ----- > From: n keywal <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Thursday, September 6, 2012 12:53 AM > Subject: Re: Hbase Assignments in trunk. > > On the Async vs. sync: there are 3 different ways to write multiple znodes > in ZK, and huge differences in the performances between them: > > 1) for loop sync > 2) for loop async > 3) multi > > Async will be 20 to 100 times faster than sync. multi will be 2 to 4 times > faster than async (that is, 80 to 400 times faster than sync). > > Multi was not available before ZK 3.4. It has several obvious advantages > over async imho: it's faster, it's synchronous and it's a transaction. That > simplifies the user code usually. > > It has other advantages: > - async and sync will typically send 1 or more packet per znode (naggle is > not activated iirc), while there will be only a few packets for all the > znodes with multi > - you can expect async to write multiple times on the disk, while multi > should write only once. This is usually better for i/o performances. > > On a serious recovery situation, with all the regions moving all other the > place, saving disk and network i/o for ZooKeeper is important. > > Disadvantage: it's new. > > On Thu, Sep 6, 2012 at 7:49 AM, Stack <[EMAIL PROTECTED]> wrote: > > > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <[EMAIL PROTECTED]> wrote: > > > Here's a link to the pdf/picture. > > > > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf> > > > > > > Pretty picture. Not a pretty story. > > > > What you thinking? > > > > St.Ack > > > >
-
Re: Hbase Assignments in trunk.
n keywal 2012-09-11, 07:52
region assignment in ZK could be interesting. + having the regionserver state available. This would require some work in ZK I fear (ZOOKEEPER-1147). However, persisting data in ZK is dangerous: this leads to have the cluster state shared in two references, making the whole thing complicated to manage (I'm thinking about snapshots for example). It should be possible to restart the cluster with an empty ZK. The single persisting store being HBase/HDFS. And making 3.4+ mandatory for 0.98 seems a good thing to do as well :-). On Tue, Sep 11, 2012 at 4:45 AM, Enis Söztutar <[EMAIL PROTECTED]> wrote: > +1 on rethinking the assignment + splitting code paths, and using zk as a > transactional database. Just my 2 cents w/o spending a lot of time on the > details, but maybe we should stop keeping master and RS in memory metadata, > but keep region-assignments in zk, and HM and RS just keep a consistent > in-memory cache. > > Enis > > On Mon, Sep 10, 2012 at 3:29 PM, lars hofhansl <[EMAIL PROTECTED]> > wrote: > > > I've been saying a while ago that we should require ZK 3.4.x for 0.96+. > > > > Distributed consensus without a "transaction" option always rang a bit > > weird to me. > > > > Maybe switch in 0.98+? > > > > -- Lars > > > > > > ----- Original Message ----- > > From: n keywal <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Cc: > > Sent: Thursday, September 6, 2012 12:53 AM > > Subject: Re: Hbase Assignments in trunk. > > > > On the Async vs. sync: there are 3 different ways to write multiple > znodes > > in ZK, and huge differences in the performances between them: > > > > 1) for loop sync > > 2) for loop async > > 3) multi > > > > Async will be 20 to 100 times faster than sync. multi will be 2 to 4 > times > > faster than async (that is, 80 to 400 times faster than sync). > > > > Multi was not available before ZK 3.4. It has several obvious advantages > > over async imho: it's faster, it's synchronous and it's a transaction. > That > > simplifies the user code usually. > > > > It has other advantages: > > - async and sync will typically send 1 or more packet per znode (naggle > is > > not activated iirc), while there will be only a few packets for all the > > znodes with multi > > - you can expect async to write multiple times on the disk, while multi > > should write only once. This is usually better for i/o performances. > > > > On a serious recovery situation, with all the regions moving all other > the > > place, saving disk and network i/o for ZooKeeper is important. > > > > Disadvantage: it's new. > > > > On Thu, Sep 6, 2012 at 7:49 AM, Stack <[EMAIL PROTECTED]> wrote: > > > > > On Wed, Sep 5, 2012 at 5:17 PM, Jonathan Hsieh <[EMAIL PROTECTED]> > wrote: > > > > Here's a link to the pdf/picture. > > > > > > > > http://people.apache.org/~jmhsieh/hbase/120905-hbase-assignment.pdf> > > > > > > > > > Pretty picture. Not a pretty story. > > > > > > What you thinking? > > > > > > St.Ack > > > > > > > >
|
|