|
|
-
assignment - is master beeing a watchdog useful?
Nicolas Liochon 2012-12-05, 19:48
See comments in HBASE-7247: the master checks the time spent by the regionserver, and assign it to another if it takes too long. It adds complexity.
from Stack: "I'm currently of the opinion that this expensive facility of master failing an open because it has been taking too long on a particular regionserver has been of no use – worse, it has only caused headache – but I may be just not remembering and others out on dev list will have better recall than I."
So, opinions & memories are more than welcome. Removing this feature would be a huge simplification!
Cheers,
Nicolas
+
Nicolas Liochon 2012-12-05, 19:48
-
assignment - is master beeing a watchdog useful?
Andrew Purtell 2012-12-06, 01:20
My information here may be stale.
I remember we increased the timeout interval from 3 to 30 minutes, because the master injecting itself into mid-assignment often triggered races and led to double assignments and other bad stuff. At 30 minutes, this is not useful IMO. As an operator I'd run hbck to sort it out long before then. On Thursday, December 6, 2012, Nicolas Liochon wrote:
> See comments in HBASE-7247: the master checks the time spent by the > regionserver, and assign it to another if it takes too long. It adds > complexity. > > from Stack: "I'm currently of the opinion that this expensive facility of > master failing an open because it has been taking too long on a particular > regionserver has been of no use – worse, it has only caused headache – but > I may be just not remembering and others out on dev list will have better > recall than I." > > So, opinions & memories are more than welcome. > Removing this feature would be a huge simplification! > > Cheers, > > Nicolas > -- Best regards,
- Andy
Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White)
+
Andrew Purtell 2012-12-06, 01:20
-
Re: assignment - is master beeing a watchdog useful?
Jimmy Xiang 2012-12-06, 02:57
Ideally, we don't need a watch dog. If we open a region on a region server, the region will be opened there quickly. If the region server dies in the middle, ServerShutdownHandler will take care of it.
If this region server happens to be hot, it may take a while to open it. If we don't time it out, the server may be even hotter. If the region server could not open it here, other region servers may not be able to open it either.
By the way, currently, the timeout interval is 10 minutes.
If it is ok for the hot region server issue, I don't see why we can't remove it, right now.
Thanks, Jimmy
On Wed, Dec 5, 2012 at 5:20 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: > My information here may be stale. > > I remember we increased the timeout interval from 3 to 30 minutes, because > the master injecting itself into mid-assignment often triggered races and > led to double assignments and other bad stuff. At 30 minutes, this is not > useful IMO. As an operator I'd run hbck to sort it out long before then. > > > On Thursday, December 6, 2012, Nicolas Liochon wrote: > >> See comments in HBASE-7247: the master checks the time spent by the >> regionserver, and assign it to another if it takes too long. It adds >> complexity. >> >> from Stack: "I'm currently of the opinion that this expensive facility of >> master failing an open because it has been taking too long on a particular >> regionserver has been of no use – worse, it has only caused headache – but >> I may be just not remembering and others out on dev list will have better >> recall than I." >> >> So, opinions & memories are more than welcome. >> Removing this feature would be a huge simplification! >> >> Cheers, >> >> Nicolas >> > > > -- > Best regards, > > - Andy > > Problems worthy of attack prove their worth by hitting back. - Piet Hein > (via Tom White)
+
Jimmy Xiang 2012-12-06, 02:57
-
Re: assignment - is master beeing a watchdog useful?
Stack 2012-12-06, 03:53
On Wed, Dec 5, 2012 at 6:57 PM, Jimmy Xiang <[EMAIL PROTECTED]> wrote:
> If this region server happens to be hot, it may take a while to open > it. If we don't time it out, the server may be even hotter. If the > region server could not open it here, other region servers may not be > able to open it either. > I suppose the master can still 'timeout' the open if the RS is watching the znode for the region it is trying to open. The RS will notice that master has assumed control in a callback and can then cancel any ongoing open.
St.Ack
+
Stack 2012-12-06, 03:53
-
Re: assignment - is master beeing a watchdog useful?
Jimmy Xiang 2012-12-06, 17:26
Currently, rs doesn't watch the znode. RS cancels ongoing open after master tells it so.
Jimmy
On Wed, Dec 5, 2012 at 7:53 PM, Stack <[EMAIL PROTECTED]> wrote: > On Wed, Dec 5, 2012 at 6:57 PM, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > >> If this region server happens to be hot, it may take a while to open >> it. If we don't time it out, the server may be even hotter. If the >> region server could not open it here, other region servers may not be >> able to open it either. >> > > > I suppose the master can still 'timeout' the open if the RS is watching the > znode for the region it is trying to open. The RS will notice that master > has assumed control in a callback and can then cancel any ongoing open. > > St.Ack
+
Jimmy Xiang 2012-12-06, 17:26
-
Re: assignment - is master beeing a watchdog useful?
Sergey Shelukhin 2012-12-06, 18:18
I may be missing some past context here, but why not make it so that the assignment zookeeper node is ephemeral, so it dies with the server? Then it will be possible to notice there's no more assignment without the separate watcher.
I have conflicting opinions about the current safeguard; on one hand, I've seen at least one bug (HBASE-6060) that was fixed (on 0.96 but explicitly not in 0.94) that resulted in region never being assigned (until the 30min watcher kicked in, that is). On the other hand, making catch-alls for code bugs in this manner seems like a bad practice. Maybe we can remove it when we have "bulletproof" unit(!) tests for AM that take into account various scenarios.
On Thu, Dec 6, 2012 at 9:26 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote:
> Currently, rs doesn't watch the znode. RS cancels ongoing open after > master tells it so. > > Jimmy > > On Wed, Dec 5, 2012 at 7:53 PM, Stack <[EMAIL PROTECTED]> wrote: > > On Wed, Dec 5, 2012 at 6:57 PM, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > > > >> If this region server happens to be hot, it may take a while to open > >> it. If we don't time it out, the server may be even hotter. If the > >> region server could not open it here, other region servers may not be > >> able to open it either. > >> > > > > > > I suppose the master can still 'timeout' the open if the RS is watching > the > > znode for the region it is trying to open. The RS will notice that > master > > has assumed control in a callback and can then cancel any ongoing open. > > > > St.Ack >
+
Sergey Shelukhin 2012-12-06, 18:18
-
Re: assignment - is master beeing a watchdog useful?
Jimmy Xiang 2012-12-06, 18:35
We can't make the assignment znode ephemeral. It is used to track region assignments, and recovery. For example, if a region is moving from rs A to rs B, while it is opening on B and B and the master die. If the znode is gone with B, then the new backup master will think the region is still open on rs A since A is live and meta still shows the region is on A, which is not the case.
Thanks, Jimmy
On Thu, Dec 6, 2012 at 10:18 AM, Sergey Shelukhin <[EMAIL PROTECTED]> wrote: > I may be missing some past context here, but why not make it so that the > assignment zookeeper node is ephemeral, so it dies with the server? > Then it will be possible to notice there's no more assignment without the > separate watcher. > > I have conflicting opinions about the current safeguard; on one hand, I've > seen at least one bug (HBASE-6060) that was fixed (on 0.96 but explicitly > not in 0.94) that resulted in region never being assigned (until the 30min > watcher kicked in, that is). > On the other hand, making catch-alls for code bugs in this manner seems > like a bad practice. > Maybe we can remove it when we have "bulletproof" unit(!) tests for AM that > take into account various scenarios. > > On Thu, Dec 6, 2012 at 9:26 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > >> Currently, rs doesn't watch the znode. RS cancels ongoing open after >> master tells it so. >> >> Jimmy >> >> On Wed, Dec 5, 2012 at 7:53 PM, Stack <[EMAIL PROTECTED]> wrote: >> > On Wed, Dec 5, 2012 at 6:57 PM, Jimmy Xiang <[EMAIL PROTECTED]> wrote: >> > >> >> If this region server happens to be hot, it may take a while to open >> >> it. If we don't time it out, the server may be even hotter. If the >> >> region server could not open it here, other region servers may not be >> >> able to open it either. >> >> >> > >> > >> > I suppose the master can still 'timeout' the open if the RS is watching >> the >> > znode for the region it is trying to open. The RS will notice that >> master >> > has assumed control in a callback and can then cancel any ongoing open. >> > >> > St.Ack >>
+
Jimmy Xiang 2012-12-06, 18:35
-
Re: assignment - is master beeing a watchdog useful?
Stack 2012-12-06, 18:34
On Thu, Dec 6, 2012 at 9:26 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote:
> Currently, rs doesn't watch the znode. RS cancels ongoing open after > master tells it so. >
Is there a cancel message? Or do you mean master sending a close on a region that is opening will result in our canceling the open? Thanks Jimmy, St.Ack
+
Stack 2012-12-06, 18:34
-
Re: assignment - is master beeing a watchdog useful?
Jimmy Xiang 2012-12-06, 18:39
The master sending a close on a region that is opening will result in cancelling the open.
As to the TM, it is really for exceptional case. In normal scenario, we don't need it. Before we remove it, should we list all exceptional cases and make sure they are covered?
Thanks, Jimmy
On Thu, Dec 6, 2012 at 10:34 AM, Stack <[EMAIL PROTECTED]> wrote: > On Thu, Dec 6, 2012 at 9:26 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > >> Currently, rs doesn't watch the znode. RS cancels ongoing open after >> master tells it so. >> > > Is there a cancel message? Or do you mean master sending a close on a > region that is opening will result in our canceling the open? > Thanks Jimmy, > St.Ack
+
Jimmy Xiang 2012-12-06, 18:39
-
Re: assignment - is master beeing a watchdog useful?
Stack 2012-12-06, 18:44
On Thu, Dec 6, 2012 at 10:39 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote:
> The master sending a close on a region that is opening will result in > cancelling the open. > > So you think we could use this in say the TM instead of taking over the znode and hoping it will notice the hijacking (currently, by trying to transition the znode or alternatively by having a watcher on the znode so it gets a callback when the master takes it over)? > Before we remove it, should we list all exceptional cases and make > sure they are covered? >
No harm I'd say. Could do it in hbase-7247?
St.Ack
+
Stack 2012-12-06, 18:44
-
Re: assignment - is master beeing a watchdog useful?
Jimmy Xiang 2012-12-06, 18:55
On Thu, Dec 6, 2012 at 10:44 AM, Stack <[EMAIL PROTECTED]> wrote: > On Thu, Dec 6, 2012 at 10:39 AM, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > >> The master sending a close on a region that is opening will result in >> cancelling the open. >> >> > So you think we could use this in say the TM instead of taking over the > znode and hoping it will notice the hijacking (currently, by trying to > transition the znode or alternatively by having a watcher on the znode so > it gets a callback when the master takes it over)? >
Currently, the rs seems to not have a watcher on the znode. It could use a watcher of course.
> >> Before we remove it, should we list all exceptional cases and make >> sure they are covered? >> > > No harm I'd say. Could do it in hbase-7247?
OK. > > St.Ack
+
Jimmy Xiang 2012-12-06, 18:55
-
Re: assignment - is master beeing a watchdog useful?
Andrew Purtell 2012-12-06, 03:53
10 minutes is still too long to be useful IMO.
On 12/6/12, Jimmy Xiang <[EMAIL PROTECTED]> wrote: > Ideally, we don't need a watch dog. If we open a region on a region > server, the region will be opened there quickly. If the region server > dies in the middle, ServerShutdownHandler will take care of it. > > If this region server happens to be hot, it may take a while to open > it. If we don't time it out, the server may be even hotter. If the > region server could not open it here, other region servers may not be > able to open it either. > > By the way, currently, the timeout interval is 10 minutes. > > If it is ok for the hot region server issue, I don't see why we can't > remove it, right now. > > Thanks, > Jimmy > > On Wed, Dec 5, 2012 at 5:20 PM, Andrew Purtell <[EMAIL PROTECTED]> wrote: >> My information here may be stale. >> >> I remember we increased the timeout interval from 3 to 30 minutes, >> because >> the master injecting itself into mid-assignment often triggered races and >> led to double assignments and other bad stuff. At 30 minutes, this is not >> useful IMO. As an operator I'd run hbck to sort it out long before then. >> >> >> On Thursday, December 6, 2012, Nicolas Liochon wrote: >> >>> See comments in HBASE-7247: the master checks the time spent by the >>> regionserver, and assign it to another if it takes too long. It adds >>> complexity. >>> >>> from Stack: "I'm currently of the opinion that this expensive facility >>> of >>> master failing an open because it has been taking too long on a >>> particular >>> regionserver has been of no use – worse, it has only caused headache – >>> but >>> I may be just not remembering and others out on dev list will have >>> better >>> recall than I." >>> >>> So, opinions & memories are more than welcome. >>> Removing this feature would be a huge simplification! >>> >>> Cheers, >>> >>> Nicolas >>> >> >> >> -- >> Best regards, >> >> - Andy >> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein >> (via Tom White) >
+
Andrew Purtell 2012-12-06, 03:53
|
|